Social and emotional skills are important for adaptive functioning in everyday life (Soto et al., 2020). Clinical researchers have developed an array of psychological assessments to measure these skills (e.g., Abrahams et al., 2019) and to explain difficulties in social responsiveness or behavior observed in various psychological conditions including autism spectrum disorder (ASD; Morrison et al., 2019) and schizophrenia (Pinkham et al., 2018). Cognitive neuroscience researchers have also used these tools to identify brain areas associated with social functioning (Schaafsma et al., 2015; Schurz et al., 2021). Due in part to the COVID-19 pandemic, however, many clinical and research practices have needed to shift from administering in-person assessments to using measures that can be administered remotely during telemedicine visits or online studies (Türközer & Öngür, 2020). This shift to online, non-proctored assessment has already become the predominant mode in employment testing (e.g., Tippins 2015) prior to the pandemic. In contrast, many of the assessments used in clinical or developmental research are designed for in-person, proctored administration. Few studies to date have documented the psychometric properties of social intelligence measures in clinical and typically developing samples (e.g., Gourlay et al., 2020; Pinkham et al., 2018), and there has been little research to determine whether any of these instruments are suitable for remote, online administration.

Although recent work has helped establish new, web-based cognitive ability assessments (e.g., Biagianti et al., 2019; Liu et al., 2020; Wright, 2020), few studies have focused on designing remote measures of social intelligence. This is important given that social intelligence is generally affected by ASD or similar, developmental disorders (Velikonja et al., 2019). Presently, many of the social intelligence measures that can be administered online rely on self- or observer-reports such as the Social Responsiveness Scale (SRS; Constantino et al., 2003), the Autism Quotient (AQ; Baron-Cohen et al., 2001b), or the Broad Autism Phenotype Questionnaire (BAPQ; Hurley et al., 2007). In contrast, many of the validated, performance-based social intelligence tests are designed to be completed in-person and administered by a proctor or trained clinician. This makes them ill-suited for remote administration and presents a challenge given the growing need for remote, web-based assessments. To this end, the Social Shapes Test (SST; Brown et al., 2019) was developed as a simple, self-administered social intelligence test based on the animated shape task created by Heider and Simmel (1944). To date, however, the SST has only been validated for use with adults without ASD. Therefore, we conducted the present study to examine whether the SST is appropriate for use as a remote, performance-based social intelligence test for adults with ASD.

We consider the SST, along with other existing animated shape tasks, to be measures of social intelligence (SI). We define SI as the ability to perceive and decode the internal states, motives, and behaviors of others (Mayer & Salovey, 1993; Lievens & Chan, 2010). This operational definition overlaps with those for constructs commonly studied in autism research like mentalizing and Theory of Mind (ToM; Luyten et al., 2020). Some scholars have recently expressed concern regarding the accumulation of narrowly defined social and emotional abilities and the potential for jingle and jangle fallacies (Olderbak & Wilhelm, 2020; Quesque & Rossetti, 2020). An example of this concern is that a task in which individuals are asked to identify mental states from pictures of human faces (e.g., the Reading the Mind in the Eyes test; Baron-Cohen et al., 2001) has been variously characterized as a measure of Theory of Mind, mentalizing ability, empathic accuracy, face processing, and emotion recognition across different studies (Oakley et al., 2016). Therefore, we use the more inclusive term SI given its long history in psychological research and its broader use across research fields relative to other terms (e.g., Theory of Mind which is more specific to developmental research, or mentalizing, which is more specific to social cognitive neuroscience).

Measuring Social Intelligence Using Animated Shape Tasks

The original animated shape task was developed by Heider & Simmel (1944) who famously observed that research participants often described the movements of simple animated, geometric shapes in human psychological terms. This pioneering work inspired several streams of research where scholars sought to identify individual differences in Theory of Mind or mentalizing ability using the original film or newly created shape animations. Klin (2000) used the original Heider and Simmel film to create the Social Attribution Task (SAT). In this task, individuals were shown the film and asked to provide written responses to 17 questions about the events in the film (e.g., “What happened to the big triangle”). Each question was asked after participants viewed specific segments of the film. These responses were scored by human raters based on the use of specific kinds of terms indicating concepts such as emotions, mental states, or behaviors. Klin reported that individuals with autism or Asperger’s disorder made fewer social attributions compared to individuals without ASD as indicated by using fewer mental or emotional state terms, mentioning fewer personality features of the shapes, and difficulty identifying the social meaning of the shapes’ movements. Likewise, SAT scores have been found to predict the severity of ASD-related social symptoms among a sample of children with ASD but average general intelligence (Altschuler et al., 2018). Researchers have observed modest test-retest reliability for SAT scores. Most recently, Altschuler and Faja (2022) reported stronger reliability for spontaneous ToM and cognitive ToM scores but slightly weaker reliability for affective ToM scores. A modified version of the SAT has also been used in neuroimaging research to identify differences in activation of brain regions related to social information processing between individuals with and without ASD (e.g., Vandewouw et al., 2021).

The Frith-Happé animation task is similar to the Social Attribution Task but consists of 12 short films with each featuring two animated triangles (Abell et al., 2000). In each film, the movements of the triangles are meant to depict interactions involving mental states, only physical, goal-directed interaction, or purposeless movement. A recent meta-analysis studies which have used the Frith-Happé animations (k = 33 studies) found that individuals with ASD are less able to correctly categorize animations which are designed to depict mentalizing compared to animations containing only goal-directed or random movement (Wilson, 2021). In addition, the Frith-Happé animations have also been used to identify similar difficulties in social attribution among adult patients with schizophrenia (Martinez et al., 2019).

Although most studies have focused on mental state attributions from written responses to these animated shape stimuli, some scholars have adapted these tasks into a multiple-choice test format. A 19-item, multiple choice version of the Social Attribution Task (SAT-MC) was designed by Bell and colleagues (2010). This test uses the same film as the SAT but replaces the narrative responses with targeted multiple-choice questions which are scored as either correct or incorrect. Performance on the SAT-MC has been found to be positively related to performance on other social cognition tasks including the Bell-Lysaker Emotion Recognition Task and the Mayer-Salovey-Caruso Emotional Intelligence Test (Bell et al., 2010). Adults with schizophrenia have also been found to perform significantly worse on the SAT-MC compared to a group of healthy controls (Johannesen et al., 2018; Pinkham et al., 2018). SAT-MC scores were also found to be positively related with social skills as assessed by a standardized role-playing task. In addition, the test has also displayed promising validity in autism research where a recent pilot study by Burger-Caplan et al. (2016) found that children with an ASD diagnosis scored 0.87 standard deviations worse on the test compared to healthy controls.

Similar to the SAT-MC, White and colleagues (2011) designed a multiple-choice task using the Frith-Happé animations. In this version of the task, individuals are asked to correctly categorize each film as demonstrating either theory of mind, physical interaction, or random movement. Performance on this task has been found to positively correlated with performance on other social intelligence tasks while also displaying modest group score differences favoring IQ-matched, typically developing adults (Brewer et al., 2017). This task has also been administered as an online task completed by adults with and without ASD diagnoses in recent research (Livingstone et al., 2021). However, the multiple-choice version of the Frith-Happé animation task has only been used in seven of the 33 studies identified by Wilson (2021).

One benefit of these various animated shape tasks is that they rely less on reading skill or verbal knowledge and comprehension compared to other measures of SI. For example, tasks like the Faux Pas (Baron-Cohen et al., 1999) or the Hinting task (Corcoran et al., 1995) require reading and interpreting written descriptions of social interactions. Other tasks like the Reading in the Mind in the Eyes Test (RMET; Baron-Cohen et al., 2001a) require knowledge of words used to describe emotional or mental states which are not commonly used in everyday language (Kittel et al., 2022; Peterson & Miller, 2012). These tasks may confound social intelligence with verbal ability where some individuals could use their verbal skills to compensate for low SI (Livingston & Happé, 2017). Another advantage to animated shape tasks is that they are abstract and do not include any obvious cultural or gender cues. These cues, such as those present in emotion recognition tasks which only use faces of White or Caucasian individuals, can result in mean test score differences due to race or ethnicity among clinical and nonclinical populations (Dodell-Feder et al., 2020; Pinkham et al., 2017). In contrast, animated shape tasks have displayed little if any racial or ethnic group differences in past research which makes them potentially suitable for studies involving international samples (Brown et al., 2019, 2022; Lee et al., 2018). However, several of the existing animated shape tasks are not well-suited for remote testing. For example, the original SAT and Frith-Happé animations, require a clinician or administrator to ask questions and to record and score verbal responses from participants. Not only could this introduce confounding effects of verbal ability or rater bias, but it also increases administration time and financial costs (Livingston et al., 2019). Although more recent versions of these tasks use a fixed set of multiple choice questions (e.g., SAT-MC), an administrator is still needed in order to operate the specific video segments for each question. This is problematic because it prevents participants from being able to complete these tasks remotely which likely prevents some researchers from using these tasks as studies increasingly shift from being conducted in-person to online. Other alternative versions of these tasks are primarily designed for neuroimaging studies and are also not well-suited for brief, online assessment (Ludwig et al., 2020).

The Social Shapes Test (SST)

The SST is a 23-item multiple choice test designed to measure individual differences in social intelligence among neurotypical adults. Each SST item consists of a short, 13–23 s animated video which includes a standard set of colored, geometric shapes. Each video features a different social plot where the shapes display a variety of behaviors including bullying, helping, comforting, deceiving, and playing. Some animations were designed to mimic the bullying behavior which appears in the original Heider and Simmel video. Others were designed to represent false belief tasks. In sum, these animations have been found to elicit a similar degree of social attributions in written descriptions compared to those reported by Klin in a prior study (Ratajska et al., 2020). In this study, Ratajska et al. (2020) scored narrative descriptions for each of the SST videos using Klin’s (2000) Theory of Mind indices and found that the range of scores for SST items overlapped with those reported for the original Heider and Simmel film. All videos are controlled by the participant and can be viewed as many times as desired. Before starting the SST, participants are given the following instructions:

“In this task, you will see a series of short, silent, animated videos. The shapes in these videos can be interpreted as people interacting with each other.

First, please watch each video carefully and completely. After watching the video, select the best answer to the multiple choice question listed below the video. Make sure to answer all of the questions to the best of your ability.

Next is a practice item. Please watch the video and try your best to answer the question. Note that you are allowed to replay a video as many times as you want while answering the question. Please do not expand the videos to full screen.”

Next, all participants are given a sample item followed by feedback indicating the correct response (Fig. 1). All 23 items are subsequently administered in the same order for all participants.

Fig. 1
figure 1

Practice SST Item. All participants were given a practice item (A) before starting the 23-item SST. After responding to the practice item, participants received feedback which identified the correct response (B)

Unlike other SI tasks, the SST was explicitly designed to be completely self-administered online, as was done in initial validation studies (Brown et al., 2019). All questions are scored using an objective scoring key which helps prevent potential rater bias when scoring open responses which are used in other animated shape tasks (White et al., 2011). Like the SST, an updated version of the Frith-Happé animations was developed for remote, online administration (Livingston et al., 2021). Although versions of the Frith-Happé animations have been found to detect differences in social intelligence between neurotypical adults and adults with ASD, they have rarely been administered to large samples of typically developing adults. Lastly, all SST questions and video files are freely available for research use and can be accessed via the Open Science Framework ( Researchers are free to use the SST videos to administer the test as part of an online survey or to adapt the videos in order to suit their own individual studies. This makes the SST more easily accessible for researchers, especially compared to other video-based social intelligence tests owned or distributed by commercial test publishers (e.g., The Awareness of Social Inference Test – TASIT; McDonald, 2012). The video content in the SST is also relatively short (each animation ranges between 13 and 23 s in length) which helps minimize administration time compared to other, video-based measures of social intelligence (e.g., Movie for the Assessment of Social Cognition; Dziobek et al., 2006).

The SST is also unique in that that it was originally developed and validated using samples of undergraduate college students and crowdsourced participants from Amazon Mechanical Turk (MTurk) who were not selected for prior history or diagnosis of ASD. In these studies, the SST has demonstrated modest internal consistency (α > 0.65) and promising convergent validity with other performance measures of social intelligence. Among MTurk workers, SST scores were found to be positively related to emotion recognition ability as assessed by the RMET (r = .47). Individuals who scored higher on the SST were also more effective at identifying the correct emotion or mental state based on written scenarios in the Situational Test of Emotional Understanding (r = .48; Brown et al., 2019). In a subsequent study of undergraduate psychology students, those who scored higher on the SST were better at identifying the best behavioral solutions to interpersonal workplace situations in a situational judgment task (r = .40; Brown et al., 2022). These relationships remained even after controlling for differences in more general cognitive abilities (e.g., verbal or spatial abilities) and educational attainment. Despite these promising results, however, it is uncertain whether the SST can adequately assess differences in social intelligence among adults with ASD or other developmental disorders.

Present Study

We designed the present study to investigate whether the SST is appropriate to be self-administered remotely to measure social intelligence among adults with ASD. Our first aim is to test for measurement invariance of the SST between adults with or without ASD. Our second aim is to collect further validity evidence for the SST by testing whether unaffected, typically developing adults score higher on the test compared to adults who have been diagnosed with ASD. Based on similarities in test content (e.g., use of similar, geometric shape animations) and existing convergent validity evidence from typically developing adult samples, we expect that the SST and other animated shape tasks measure a similar, underlying social intelligence construct. Therefore, we expect that adults with ASD should score lower on the SST compared to adults without ASD as observed in prior research using similar animated shapes tasks like the Frith-Happé animations or the SAT-MC (Burger-Caplan et al., 2016; Livingstone et al., 2021; Wilson, 2021). We also conducted a second study to gather further reliability and validity evidence for two, alternate 14-item forms of the SST and to compare performance on the SST with scores on an existing animated shape task (the Frith-Happé animation task; White et al., 2011).

Study 1



Participants in Study 1 included of a variety of adults with or without a prior diagnosis of autism spectrum disorder (ASD). We recruited 261 participants who self-reported a diagnosis of ASD, autistic disorder, or Asperger’s disorder from the Simons Foundation Powering Autism Research for Knowledge (SPARK; SPARK Consortium, 2018). This cohort consists of individuals with ASD and their first-degree relatives. All of these individuals who were recruited for this study presently live independently and did not have a record of cognitive impairment when they joined the SPARK cohort. A broader description of adults in the SPARK cohort was recently reported by Fombonne et al. (2020). All SPARK participants were given a $10 Amazon gift card for completing the study. Although diagnosis history was collected using either self- or parent-reports, rather than direct clinical evaluation, past research has found that this method yields reliable accounts of autism diagnoses in other research registries (Daniels et al., 2012).

To account for the lack of clinical data for the independent adult ASD sample recruited via SPARK, we also recruited a second sample of 25 adults who had previously received a clinical diagnosis of ASD from a neurodevelopmental clinic. All of these 25 individuals had sought clinical services in the Northeastern U.S. and had consented to be contacted for ongoing research studies. Due to the smaller pool of eligible participants compared to the SPARK cohort, these participants were given a larger reward of a $35 Amazon gift card for completing this study. All participants were recruited online and completed the SST without a proctor or administrator. Participants ranged from 18 to 34 years of age (mean age = 20.8, SD = 3.9). Most participants identified as male (20/25; 80%) and as White, non-Hispanic (22/25; 88%). Based on assessment scores obtained from the electronic medical record, the average full-scale IQ score among the ASD group was 86.1 (SD = 22.4). T-scores from the Social Responsiveness Scale (SRS; Constantino et al., 2003) indicated an elevated level of autistic symptoms among most of the participants in clinical ASD group as well (M = 74.2, SD = 12.4).

We also recruited adults without ASD for this study. One group of adults without ASD was recruited from SPARK; these individuals were parents of one or more children with an ASD diagnosis who themselves had never received a diagnosis (SPARK parent; n = 217). Although these adults did not report any history of ASD, they may be at a greater genetic risk for ASD compared to the general population. Therefore, we also relied on data collected from adult participants in two prior studies (Brown et al., 2019, 2022) for a comparison group of adults without ASD. Unlike the SPARK parents, we assume that the adult participants from prior studies were not likely to share a potential genetic predisposition to ASD or to other developmental disorders given the relatively low rate of ASD in the general population. There was also no focus on ASD or developmental disorders in either of the prior studies from which these participants were recruited. A total of 829 participants were recruited from undergraduate psychology courses at a public university in the Midwestern U.S. and from Amazon’s Mechanical Turk. Most of these participants identified as female (59%) and White, non-Hispanic (56%). All prior study adult participants had completed the SST as part of a self-administered, online Qualtrics survey.

Data Cleaning

Prior to data analysis, we removed participants who had a median response time of less than 10 s per SST item. This response time threshold was chosen to increase the likelihood that participants watched the entire video for each item and to remove potential cases of non-purposeful responding (all 23 SST videos were 13 s in length or longer). We observed that a greater proportion of the adults without ASD were removed based on our response time threshold (21%) compared to SPARK participants with ASD (3%). A total of five participants were removed from the clinical sample. We also removed participants who failed to respond correctly to all four attention-check items. In each such item, participants were asked to watch a different shape animation and to identify which of four shapes did not appear in the video. The attention-check items depend only on basic cognitive processes (e.g., vision, attention, memory) and should not require social intelligence to solve. More than 90% of participants with or without ASD were able to correctly identify the missing shape in all four items. This left us with a total sample of n = 1,275 participants (ASD n = 229; SPARK parent n = 217; without ASD n = 829). We provide a full summary of the key demographic variables for each group in Table 1.

Table 1 Study 1 Participant Demographics


All study materials were presented in an online survey which was accessed via a link sent using email. Participants were given a brief set of instructions and a practice item before beginning the SST. Afterwards, participants completed several demographic items regarding their geographical location, educational attainment, self-identified race/ethnicity, and approximate annual income. Participant sex and age for participants recruited via SPARK was provided by the SPARK consortium. All SST scores were calculated as the simple sum of correct responses across the 23 items.

Statistical Analysis

All analyses were performed using R version 3.6.3. We tested for measurement invariance using confirmatory factor analysis models estimated using the lavaan package (Rossel, 2012). We also used multiple linear regression in order to statistically control for demographic differences between the three diagnosis groups. We report the standardized mean difference (Cohen’s d) when interpreting differences in SST scores based on ASD group where negative values indicate lower scores compared to adults without ASD.


In order to investigate our first aim and determine whether the SST functions similarly for adults with or without ASD, we tested for measurement invariance between participants with or without ASD. It is important to rule out measurement invariance in order to determine whether the observed score differences between groups are due to true differences in the construct of interest and not a result of differences in the test’s measurement properties (Vandenberg & Lance, 2000). We focused on metric invariance which tests whether the primary factor loadings for test items are equal across different groups. This is tested by first specifying a single-factor, confirmatory factor analysis model where all 23 SST items load on to a single factor. Factor loadings were then estimated for all adults without ASD (participants from prior studies and SPARK parent groups combined). Next, this model is estimated using response data from the ASD group while constraining each item loading to be equal to the estimate from the group without a prior diagnosis. Constraining these factor loadings to be equivalent did not significantly reduce overall model fit, ∆χ2 (22) = 30.05, p = .12. These results were further supported by comparable estimates for internal consistency of the SST separately for participants with ASD (α = 0.72) and without ASD (α = 0.67). Item difficulties (percent correct) within the SST were also highly consistent between groups (r = .97, p < .001). This indicated that the items which were most difficult for adults with ASD were also most difficult for adults without ASD. Based on these results, we conclude that the SST provides an equivalent measure of social intelligence for adults regardless of ASD diagnosis. We report the psychometric properties and descriptive statistics for the SST within each group in Table 2. Therefore, any subsequent differences in SST scores between groups can be attributed to differences in social intelligence ability and not differences in test functioning between the two groups.

For our second aim, we tested for differences in SST scores between participants with and without ASD using linear regression. Given the heritability of ASD, we also tested for differences between adults recruited for prior SST studies and parents of an affected child (SPARK parents). We first regressed SST scores on dummy-coded diagnosis variables representing adults with ASD and SPARK parents (Model 1 in Table 3). A statistically significant regression coefficient for either dummy-coded variable indicates a meaningful difference in SST scores compared adults without ASD from prior studies. Participants with ASD did score significantly lower on the SST relative to adults without ASD from prior studies (β = –0.08, p = .006, d = − 0.21, 95% CI = [–0.35, − 0.06]). We provide a histogram illustrating this difference in SST scores in Fig. 2. SPARK parents also scored lower on the SST compared to adults without ASD but this difference was not statistically significant (β = –0.03, p = .22, d = − 0.09, 95% CI = [–0.25, 0.06]). However, we observed several differences in the demographic makeup across our three comparison groups which may affect these observed test scores (Table 1). In particular, SPARK parents reported greater educational attainment and were older than the other two groups on average. Adults without ASD from prior studies were younger on average, reported greater educational attainment compared to adults with ASD, and were less likely to identify as White, non-Hispanic. Therefore, we also tested for differences in SST score after statistically controlling for participant age, educational attainment, and race/ethnicity. Participants with ASD still scored lower on the SST after controlling for these demographic differences (β = –0.10, p < .001; d = − 0.27). In addition, we observed a significant difference between adults without ASD and SPARK parents when holding age, race/ethnicity, and education constant (β = –0.07, p < .001; d = − 0.20). Among the demographic control variables, educational attainment was positively related to SST scores (β = 0.15, p < .001) and participants who identified as White scored higher on the SST, compared to all others (β = 0.10, p < .001). Age was not a significant predictor of SST scores in this regression model.

Table 2 SST Descriptive Statistics by Diagnosis Group in Study 1
Fig. 2
figure 2

Relative distribution of SST Scores for participants with ASD (n = 229) and without ASD (n = 829). Adults with ASD are displayed in red. Adults without ASD are displayed in blue. The y-axis represents the proportion of participants within each group. The x-axis represents the number of correct responses to the 23-item SST.

Fig. 3
figure 3

Correlation between alternate 14-item SST forms in Study 2

Table 3 Differences in SST Scores between Diagnosis Groups in Study 1

Study 2

Although the 23-item SST typically takes roughly 15 min to complete, this may be too long for some studies involving a series of different assessments and measures. Prior studies have developed shorter versions of commonly used SI tests (e.g., short-form Reading the Mind in the Eyes Test; Olderbak et al., 2015) to accommodate researchers who wish to include SI performance measures without making the length of the study discouraging or prohibitive to prospective participants. Therefore, we conducted Study 2 to develop an abbreviated form of the SST which could be used in situations where a shorter administration time is needed while retaining the psychometric properties of the full 23-item version. We also conducted this study in order to estimate the test-retest reliability of this shorter version of the SST. The 23-item version has also only displayed modest internal consistency with estimates ranging between α = 0.67 (Brown et al., 2019) to 0.72 among adults with ASD in Study 1. However, some scholars have argued that internal consistency underestimates the reliability of tests when item content is heterogeneous (Neubauer & Hofer, 2022). We aim to estimate the reliability of the SST as the test-retest reliability between two alternate forms as demonstrated for the SAT-MC by Pinkham et al. (2017).

A second goal of Study 2 is to gather further validation evidence for the SST by observing convergent validity with another animated shape task. As noted earlier, the Frith- Happé animation task was recently evaluated for remote, online use in research studies (Livingstone et al., 2021). Although this task uses similar shape animations as featured in the SST, there are some differences between the two measures. A prior version of the Frith-Happé task included questions where participants needed to identify mental or emotional states of specific shapes (White et al., 2011), but most of the research using the tool has only administered questions where participants only categorize the content of each video. In contrast, the SST features more focal shapes in each video (four or five shapes compared to only two in the Frith Happé task) and all SST videos involve social interactions. This potentially allows for greater granularity in measuring differences in social intelligence relative to the four Theory of Mind items in the Frith-Happé task. We include both measures in Study 2 in order to estimate the correlation between the two tasks and to observe whether each task can predict incremental variance in performance on a separate, video-based social intelligence test after controlling for individual differences in general intelligence.



We gathered a sample of 387 U.S.-based adults from Amazon’s Mechanical Turk using CloudResearch (formerly Turk Prime; Litman et al., 2017). Participants were paid $5.50 for completing the first survey and were paid $7.00 for completing a second survey one week later. Among the initial 504 participants who completed the first survey, we removed 24 participants who failed either the attention check items from each SST form or did not meet our median response time criteria. Of the remaining 480 participants, 387 returned one week later to complete the second survey (81%). Most participants identified as White (80%) and male (56%). The average participant age was 41.63 years old (SD = 12.08). We report a full summary of participant demographics in Table 4. When comparing participants who did or did not complete the second, follow-up survey, we found no differences in self-reported gender (χ2(1) = 3.66, p = .06) or educational attainment (t = 0.69, p = .49). Participants who completed both surveys were slightly older (d = 0.24, t = 2,29, p = .02) and were more likely to identify as White (80% versus 72%; χ2(1) = 4.45, p = .03) than those who completed only the first survey.

Table 4 Study 2 Demographics


All measures were administered using an online survey hosted by Qualtrics. All tasks were completed by participants in the same fixed order. Participants were also recruited to complete an alternate SST form one week after completing the initial form. On average, participants completed the second survey six days after the first administration (ranging from 6 to 10 days between administrations). After the alternate SST form, participants completed the four Theory of Mind Frith-Happé animations along with the eight feelings questions reported by White et al. (2011), the Social Norm Questionnaire (Kramer et al., 2014), and an 18-item situational judgment test of interpersonal skills.


We created two 14-item SST forms based on an item analysis of the full 23-item version using item-level data reported by Brown et al. (2019) and supplemental data which was not featured in the published article but is publicly available on the Open Science Framework ( Several newly written items were created based on existing animation files and were initially evaluated as part of a separate study. Each form featured the same 14 shape animation files but paired each with a different multiple-choice question. Each form also included a single, attention check item from the original 23-item version. All participants were randomly assigned to complete either Form A or Form B in the first survey and completed the alternate form when participating in the second survey.

After completing the SST in the first survey, participants completed the 12-item Frith-Happé animation task (Livingstone et al., 2021; White et al., 2011). In this task, participants viewed short film clips featuring two animated triangles and are asked to categorize each film as demonstrating random movement, physical or goal-directed movement, or mentalizing. Lastly, participants completed the 16-item ICAR cognitive ability test (α = 0.77; Revelle et al., 2020).

In the second survey, participants completed the alternate SST form and the objective, eight-item Frith-Happé feelings task (White et al., 2011). To assess knowledge of social norms, we next administered the 22-item Social Norm Questionnaire (SNQ; Kramer et al., 2014) and an 18-item situational judgment test. The SNQ measures knowledge of social norms (α = 0.68) and has been observed to correlate positively with other social intelligence ability tests in past research (Baksh et al., 2021). The situational judgment test (SJT) was designed to evaluate understanding of effective behavior in interpersonal interactions in a variety of everyday settings. Each item presents a short, written scenario about a social interaction. We asked participants to identify the most effective and least effective responses to each scenario among five behavioral response options (α = 0.72). This methodology is widely used in research and practice to assess interpersonal skills in adults and children (Murano et al., 2020; Webster et al., 2020).


We tested for practice effects and differences in difficulty between the two alternate forms using repeated measures ANOVA. There was no evidence for a practice effect between SST administrations in the first and second surveys, F(1,385) = 2.07, p = .15. Form A was significantly more difficult compared to Form B, F(1,385) = 158.20, p < .001, Cohen’s d = 0.55. This difference in scores between forms was consistent regardless of the order in which the forms were completed. Although participants provided more correct responses to Form B relative to Form A, we found modest test-retest reliability between the alternate forms (r = .61, p < .001; ICC = 0.52, 95% CI = [0.25, 0.68]; Fig. 3). The test-retest correlation did not vary based on the order in which the alternate forms were completed. We also found similar estimates for internal consistency for each (Form A α = 0.65; Form B α = 0.64). There was no statistically significant score differences based on participant age, gender, or race/ethnicity for either SST form. These results provide support for the test-retest reliability for the SST across alternate forms plus comparable internal consistency for each shortened form with the original 23-item version in Study 1.

Next, we correlated SST scores with performance on the Frith-Happé animation task. All correlations between study tasks are reported in Table 5. We first calculated separate categorization scores for the Theory of Mind (ToM), goal-directed (GD), and random video trials (e.g., Livingstone et al., 2021; White et al., 2011). Among these three subscales, only the random videos provided adequate internal consistency (α = 0.64). In contrast, internal consistency was very weak for the goal-directed (α = 0.33) and ToM videos (α = 0.09). Two of the ToM videos were incorrectly categorized as representing a physical interaction by a majority of participants (“Coaxing” = 85% of participants, “Seducing” = 70% of participants). The corrected item-total correlations for the four ToM videos were also very weak, ranging between –0.19 and 0.14. Likewise, two of the goal-directed videos were also incorrectly categorized as representing a mental interaction (ToM) by most participants (“Chase” = 59% of participants, “Leading” = 62% of participants). Due to these weak reliability estimates, we use overall performance scores on the Frith-Happé categorization items in our regression analyses (12-item α = 0.48). Despite these poor measurement properties, SST scores were positively correlated with categorization of all Frith-Happé videos (rForm A = 0.36, 95% CI = [0.27, 0.44], p < .001, rForm B = 0.42, 95% CI = [0.34, 0.50], p < .001).

Table 5 Study 2 Correlation Matrix

In contrast to the categorization items, we observed stronger measurement properties for the Frith-Happé feelings items. For each of the four ToM animations, participants were asked to identify the correct mental state of the small and large triangle from five response options (see White et al., 2011 for the individual questions). These eight-items displayed better internal consistency (α = 0.61) compared to the categorization task and had corrected item-total correlations ranging between 0.12 and 0.45. Performance on this task was positively related to scores on SST forms A (r = .48, 95% CI = [0.40, 0.55], p < .001) and B (r = .44, 95% CI = [0.35, 0.51], p < .001). These correlations were stronger than the observed correlation between overall performance on the Frith-Happé categorization items and the feelings items (r = .36, 95% CI = [0.27, 0.44], p < .001). Scores from each SST form also accounted for 14% of incremental variance in Frith-Happé feelings scores beyond what could be explained by general intelligence task performance (∆R2 = 0.14, F = 52.44, p < .001). SST scores also accounted for incremental variance in overall performance on the Frith-Happé categorization task beyond the effects of general intelligence (∆R2 = 0.12, F = 28.64, p < .001).

Lastly, we examined whether scores on the SST and Frith-Happé tasks accounted for incremental variance in social knowledge after controlling for individual differences in general intelligence (Table 6). Both SST and Frith-Happé feelings scores were unique predictors of social norm knowledge (∆R2 = 0.20, F = 35.81, p < .001). Likewise, both tasks were also unique predictors of interpersonal skill as measured by the SJT (∆R2 = 0.10, F = 17.61, p < .001). Frith-Happé categorization task scores were not found to be a statistically significant predictor in either model. Although we only report the models when using SST Form A scores in Table 6, we observed the same pattern of results when using scores on Form B. These results provide further support for the validity of the shorter, 14-item SST forms as a correlate of individual differences in social norm understanding and knowledge of effective interpersonal behavior.

Table 6 Incremental Prediction of Social Judgment and Understanding of Social Norms

General Discussion

Our study is one of the first to explore how adults with ASD perform on a self-administered, online SI test compared to adults without ASD (e.g., Livingstone et al., 2021). Regarding our first study aim, our data in Study 1 provide evidence for measurement invariance for the SST between adults with ASD and a large normative sample of 1,049 participants without ASD. In support of our second aim, we observed modest group mean SST score differences between adults without and with an ASD diagnosis (d = 0.21). We provide a histogram of SST scores for participants with and without ASD in Study 1 (Fig. 2). These results suggest that the SST holds promise as a valid, online, remote assessment of SI for adults in either clinical or subclinical populations. Unlike self- or observer-reported measures of autistic traits which have been found to correlate with personality traits in non-clinical samples (Ingersoll et al., 2011; Schwartzman et al., 2016), past research also indicates that SST scores are practically unrelated to self-reported personality or trait emotional intelligence scores (Brown et al., 2019). We further explored our second aim in Study 2 by observing that both the SST and the Frith-Happé feelings task were unique predictors of understanding of social norms and knowledge of effective behavior in social situations, even after controlling for general intelligence scores. In addition, scores on the SST forms were positively related to performance on the Frith-Happé feelings task and demonstrated better internal consistency relative to the Frith-Happé categorization task. These findings suggest that the SST may be useful as a complement to many of the popular existing self- or observer-reported measures.

These findings are especially promising given the growing need for valid, online, self-administered assessments. Although past research has documented the development of web-based general cognitive ability or intelligence tests (e.g., Brown & Grossenbacher 2017; Liu et al., 2020; Sliwinski et al., 2018; Wright, 2020), few performance-based SI tests besides the RMET have been used online without a proctor. Even though our participants completed the SST outside of a clinical setting and using their own device (e.g., tablet, laptop, or desktop computer), we did not detect any degradation in measurement precision or item validity. Based on these results, the SST appears useful for assessing SI while allowing participants to complete the test remotely without having to travel to a clinic or research site. We also designed alternate, 14-item short forms of the SST which can be used while retaining modest reliability and demonstrating similar validity evidence to what has been reported for the full 23-item version. These forms also displayed convergent validity with other ability measures of social intelligence and knowledge of socially acceptable behavior. This may help researchers recruit larger samples or may make participating in research studies more accessible to potential participants. Based on findings from recent research, the use of animation in the SST may also create a more engaging and enjoyable experience for participants relative to text-based assessments (Karakolidis et al., 2021).

Even though the SST was not explicitly designed to detect ASD or other developmental disorders or to quantify traits related to ASD, the test does appear to be somewhat sensitive to differences in SI between groups of participants with and without ASD. After controlling for demographic differences, we also found that adults without ASD also scored higher on the SST compared to adults without ASD but who are parents of children with ASD. These effect sizes were lower than the difference between patients with schizophrenia and controls reported for the SAT-MC (d = 0.64; Pinkham et al., 2018) and for the Frith-Happé ToM task (d = 0.58; Wilson, 2021). However, the SST displayed several potential advantages compared to other existing animated shape tasks. The 14-item alternate SST forms displayed slightly stronger test-retest reliability compared to estimates reported for the SAT-MC in prior research (r = .55 for controls and r = .57 for patients in Pinkham et al., 2017). Both SST forms displayed better internal consistency relative to the Frith-Happé categorization task. The reliability estimates that we observed for the Frith-Happé categorization task were substantially worse than those reported by Livingstone et al. (2021) and suggest that continued research is needed to determine whether these items can adequately assess social intelligence when self-administered online. Similar weaknesses were recently documented by Andersen et al. (2022) who also reported weak reliability for the categorization task in a large sample of adolescents. We argue that these results indicate that the SST may provide a more reliable measure of social intelligence in studies involving adults with and without ASD. Still, further test development work may help improve the sensitivity and further optimize the SST for assessing ability differences within clinical populations. We recommend that future researchers use the 14-item versions of the SST reported in Study 2 given that these forms provided good test-retest reliability and similar internal consistency and convergent validity as previously reported for the 23-item version. We provide the item order and text for both 14-item forms along with all of the video files on the Open Science Framework (

Implications and Directions for Future Research

We hope that our findings help provide future researchers with the tools to further explore novel ways of assessing social intelligence or similar, more narrowly defined abilities. Researchers have long struggled to develop measures of social intelligence which are empirically distinct from general mental ability or intelligence (Lievens & Chan, 2010). Despite some recent attempts to explain how social intelligence fits within a broader framework of human abilities (e.g., Bryan & Mayer 2021; MacCann et al., 2014), much of the research on social intelligence has been siloed within different subfields where construct labels and measurement methods are often inconsistent (Olderbak & Wilhelm, 2020). This makes it challenging for researchers to integrate findings from different fields and to replicate results from different populations or research settings.

Another important avenue for future research is to determine the boundary conditions for administering the SST online. In our samples, adults with ASD were able to complete the SST outside of a controlled research or clinic setting. However, many of these adults appear to be relatively high functioning, based on their self-reported educational attainment. Future studies should seek to identify criteria which would help researchers determine whether a participant could be expected to provide valid responses in a self-administered, online assessment. Likewise, our samples only included participants who were 18 years of age or older even though animated shape tasks have been used to measure SI among children and adolescents (Altschuler et al., 2018; Burger-Caplan et al., 2016; Salter et al., 2008). Given that the SST items were designed to require as little reading as possible and thus be more independent of verbal ability or language skills, we expect that the test can be used in younger populations, but this has yet to be explored empirically. Thus, further research is needed to observe how this test functions when administered to younger participants in clinical or nonclinical populations.

Future research is also needed to observe the heritability and genetic predictors of SST scores. Twin studies have estimated genetic contributions to individual differences in social cognition (Isaksson et al., 2019) and measures of social functioning (Constantino & Todd, 2003). More specifically, a recent study reported a heritability estimate of 28% for performance on the RMET (Warrier et al., 2018). We would expect to find similar heritability estimates for the SST and similar animated shape tasks based on the convergent validity evidence with the RMET, but this has yet to be empirically observed. This work could potentially determine whether different measures of SI share common genetic influences and how distinct those influences may be from those which predict performance on more general intelligence or cognitive tests.

Study Limitations

There are some limitations on the results we report in this paper. We observed that the SST provided adequate, but not ideal, internal consistency for adults with or without ASD (0.60 < α < 0.80). Based on these results, the 23-item and 14-item forms of the SST are best suited for research purposes where even modest reliability may be sufficient for detecting true effects (Schmitt, 1996). These forms are not reliable enough for high-stakes, diagnostic use, where Nunnally (1978) suggests a threshold of α ≥ 0.90. The modest level of reliability for the SST may have attenuated our observed mean differences between adults with or without ASD in Study 1. Another limitation is that we did not obtain a consistent measure of cognitive ability or intelligence for all participants in Study 1. Therefore, we were only able to control for coarse-grained educational attainment as a proxy for differences in cognitive functioning between adults with or without ASD. Our results in Study 2 indicate that performance on the SST is positively correlated with performance on a general intelligence task (Form A r = .47, p < .001; Form B r = .43, p < .001). However, we also found evidence that SST scores do correlate with performance on other social intelligence tasks even after controlling for differences in general intelligence.


Across two studies, we detected differences in social intelligence between adults with and without ASD using a remotely administered, freely available online test. Not only did we find support for measurement invariance between adults with or without an ASD diagnosis, but we also detected modest group mean differences where adults without ASD achieved higher SST scores compared to those with ASD. This effect was still present even after controlling for demographic differences between these two groups. We also designed a shorter, 14-item version of the test in Study 2. These forms provided good test-retest reliability and greater internal consistency compared to the Frith-Happé tasks. We also found that SST scores were related to knowledge of social norms and effective interpersonal behavior even after controlling for differences in general intelligence. These results indicate that the SST is a promising tool for measuring SI, especially in situations where in-person, on-site assessments are either impractical or not possible. Although future research is needed to further optimize the SST and boost its reliability for clinical purposes, this tool may help researchers obtain a quantitative measure of SI while avoiding some of the practical or psychometric limitations of other existing instruments.