TANGO: A reliable, open-source, browser-based task to assess individual differences in gaze understanding in 3 to 5-year-old children and adults

Traditional measures of social cognition used in developmental research often lack satisfactory psychometric properties and are not designed to capture variation between individuals. Here, we present the TANGO (Task for Assessing iNdividual differences in Gaze understanding-Open); a brief (approx. 5–10min), reliable, open-source task to quantify individual differences in the understanding of gaze cues. Localizing the attentional focus of an agent is crucial in inferring their mental states, building common ground, and thus, supporting cooperation. Our interactive browser-based task works across devices and enables in-person and remote testing. The implemented spatial layout allows for discrete and continuous measures of participants’ click imprecision and is easily adaptable to different study requirements. Our task measures inter-individual differences in a child (N = 387) and an adult (N = 236) sample. Our two study versions and data collection modes yield comparable results that show substantial developmental gains: the older children are, the more accurately they locate the target. High internal consistency and test–retest reliability estimates underline that the captured variation is systematic. Associations with social-environmental factors and language skills speak to the validity of the task. This work shows a promising way forward in studying individual differences in social cognition and will help us explore the structure and development of our core social-cognitive processes in greater detail. Supplementary Information The online version contains supplementary material available at 10.3758/s13428-023-02159-5.

Supplemental Material for the manuscript 'TANGO: A reliable, open-source, browser-based task to assess individual differences in gaze understanding in 3 to 5-year-old children and adults'

Effects of trial type and trial number
Children showed nearly perfect precision in the first training trial.As visual access to the target location decreased in the succeeding training trials, imprecision levels increased.
Within test trials, children's imprecision levels did not vary as a function of trial number.Comparing the performances of children across our two data collection modes, we found that children participating remotely were slightly more precise.This difference was especially prominent in younger participants in the box version of the task.It is conceivable that caregivers were especially prone to influence the behavior of younger children.In the box version, caregivers might have had more opportunities to interfere since they carried out the clicking for their children.In an exploratory analysis, we coded parental behavior and environmental factors during remote unsupervised testing.Due to the time consuming nature of hand coding videos frame by frame, we focused on the subsample with the greatest performance difference between data collection modes: the three-year-olds in the box version of the task (n = 0).We reasoned that if parental interference cannot explain the greatest performance difference in our sample, the effects would be negligible in the remaining sample.A trial was defined as the time between two eye blinking sounds.We transcribed all utterances by parents and children and counted the words uttered by each.We then classified the utterances into several categories: question asked by child, repeated test questions by caregiver, hints towards agents (how many times the caregivers guided the child's attention to the agent), hints towards eyes (how many times the caregivers guided the child's attention to the agent's eyes), verification of choice (how many times the caregiver questioned or double checked the child's response), mentioning of screen (how many times the caregiver verbally guided the child's attention to the screen), pointing to screen (how many times the caregiver pointed towards the screen), positive & negative feedback, motivational statements, and incomprehensible utterances.
In addition, we coded how many adults and children were present, whether a response click was obviously conducted by the caregiver themselves, and whether children took a break during the trial.We conducted a model comparison to estimate the effects of parental interference.Our null model explained the response behavior by age, while including random effects for subject and target position (model notation in R: correct ~age + symmetricPosition + (1 + symmetricPosition | subjID).
We compared this null model to models including the number of words uttered by the caregiver, number of repeated testquestions, verification of choice, or hints towards eyes as fixed effects.Furthermore, we calculated an parental interference index by summing up number of repeated testquestions, verification of choice, and hints towards eyes, with the sign matching the variable's direction of effect.Remaining variables that we coded for were not included since there was not enough variation and/or occurrences in our sample.We compared models using WAIC (widely applicable information criterion) scores and weights.
As an indicator of out-of-sample predictive accuracy, lower WAIC scores stand for a better model fit.WAIC weights represent the probability that the model in question provides the best out-of-sample prediction compared to the other models.On the trial level, the model including the verification of choice as a main effect performed best: here, the less the caregivers asked for children's responses again, the more likely children clicked on the correct box.Interestingly, the effect reversed on a subject level -possibly due to greater learning effects for the children that were most likely to click incorrectly in the beginning and then receiving most parental comments.On the subject level, the model including number of repeated test questions performed best: the more caregivers asked again where the target landed, the more likely children were to respond to the incorrect box.In all cases, however, ELPD difference scores were smaller than their standard errors.Similarly, 95% CI of the model estimates included zero and were rather wide.Therefore, we conclude that the effect of parental interference was negligable and could, most likely, be explained as described above.First, we calculated a sibling variety score according to Peterson (2000).Second, we implemented the modified version of Cassidy, Fineberg, Brown, and Perkins (2005).Third, based on our own data exploration, we calculated the amount of peer exposure determined as the number of siblings and the average hours spent in childcare.We compared the models using WAIC (widely applicable information criterion) scores and weights (McElreath, 2020).As an indicator of out-of-sample predictive accuracy, lower WAIC scores stand for a better model fit.WAIC weights represent the probability that the model in question provides the best out-of-sample prediction compared to the other models.(Cassidy et al., 2005)  Materials.We employed the oREV, an Item Response Theory based open receptive vocabulary task for 3 to 8-year-old children (Bohn et al., 2022).Similarly to the TANGO, the task was presented as an interactive web application (see Figure 5; live demo https://ccp-odc.eva.mpg.de/orev-demo/;source code https://github.com/ccp-eva/orev-demo).
Each trial presented four pictures: one target word alongside three distractors (1 phonological, 1 semantic, 1 unrelated distractor).A verbal prompt asked children to select  We recruited participants using the online participant recruitment service Prolific 142 from the University of Oxford.Prolific's subject pool consists of a mostly European and US-american sample although subjects from all over the world are included.The recruitment platform realises ethical payment of participants, which requires researchers to pay participants a fixed minimum wage of £5.00 (around US$6.50 or €6.00) per hour.We decided to pay all participants the same fixed fee which was in relation to the estimated average time taken to complete the task.Prolific distributed our study link to potential participants, while the hosting of the online study was done by local servers in the Max Planck Institute for Evolutionary Anthropology, Leipzig.Therefore, study data was saved only on our internal servers, while Prolific provided demographic information of the participants.Participants' Prolific ID was forwarded to our study website using URL parameters.This way, we could match participant demographic data to our study data.
The same technique was used to confirm study completion: we redirected participants from our study website back to the Prolific website using URL parameters.We used Prolific's inbuilt prescreening filter to include only participants who were fluent in English and could therefore properly understand our written and oral study instructions.

Study 1 -Validation hedge version
The aim of Study 1 was to validate the hedge version of our gaze understanding task.
The pre-registration can be found here: https://osf.io/r3bhn.We recruited participants online by advertising the study on Prolific.
50 adults participated in the study.One additional subject returned their submission, i.e., decided to leave the study early or withdrew their submission after study completion.
Data collection took place in May 2021.Participants were compensated with £1.25 for completing the study.We estimated an average completion time of 6 minutes, resulting in an estimated hourly rate of £10.00.On average, participants took 05:56min to complete the study.Participants were required to complete the study on a tablet or desktop.
Participation on mobile devices was disabled since the display would be too small and would harm click precision.It was indicated that the study required audio sound.
We stored Prolific's internal demographic information, while not asking for additional personal information.

Study 2 -Validation box version
As in study 1, we recruited participants on Prolific, and employed the same methodology.However, this time we focussed on validating the box version of the task in an adult sample.Participants were presented with eight boxes in which the target could land.50 adults participated in the study.One additional subject returned their submission, i.e., decided to leave the study early or withdrew their submission after study completion.Data collection took place in June 2021.Participants were compensated with £1.00 for completing the study.We estimated an average completion time of 6 minutes, resulting in an estimated hourly rate of £10.00.On average, participants took 04:43min to complete the study.

Study 3 -Reliability hedge version
In study 3 and 4, we assessed the test-retest reliability of our gaze understanding task in an adult sample.The pre-registration can be found here: https://osf.io/nu62m.We tested the same participants twice with a delay of two weeks.The testing conditions were as specified in Study 1 and 2. However, the target locations as well as the succession of animals and target colors was randomized once.Each participant then received the same fixed randomized order of target location, animal, and target color.Participants received 30 test trials without voice-over description, so that each of the ten bins occurred exactly three times.
In addition to the aforementioned prescreening settings, we used a whitelist.Prolific has a so-called custom allowlist prescreening filter where one can enter the Prolific IDs of participants who completed a previous study.Only these subjects are then invited to participate in a study.This way, repeated measurements can be implemented, collecting data from the same subjects at different points in time.
In a first round, 60 participants took part on the first testday.Additional two subjects returned their submission, i.e., decided to leave the study early or withdrew their submission after study completion.One additional participant timed out, i.e., did not finish the survey within the allowed maximum time.The maximum time is calculated by Prolific, based on the estimated average completion time.For this study, the maximum time amounted to 41 minutes.For the first testday, participants were compensated with £1.25.We estimated an average completion time of 9 minutes, resulting in an estimated hourly rate of £8.33.On average, participants took 07:11min to complete the first part.
Of the 60 participants that completed testday 1, 41 subjects finished testday 2. One additional participant timed out, i.e., did not finish the survey within the allowed maximum time.Participants were compensated with £1.50 for completing the second part of the study.We estimated an average completion time of 9 minutes, resulting in an estimated hourly rate of £10.On average, participants took 06:36min to complete the second part of the study.
Since we aimed for a minimum sample size of 60 subjects participating on both testdays, we reran the first testday with additional 50 participants.Additional seven subjects returned their submission, i.e., decided to leave the study early or withdrew their submission after study completion.Two additional participants timed out, i.e., did not finish the survey within the allowed maximum time.Again, participants were compensated with £1.25 for completing the first part of the study (estimated average completion time 9 minutes, estimated hourly rate of £8.33).On average, participants took 06:51min to complete the first part.
Of the additional 50 participants that completed testday 1, 29 subjects finished testday 2. Again, participants were compensated with £1.50 for completing the second part of the study (estimated average completion time 9 minutes, estimated hourly rate of £10).On average, participants took 06:26min to complete the second part of the study.

Study 4 -Reliability box version
As in study 3, we recruited participants on Prolific, and employed the same methodology.However, this time participants were presented with the box version of the task.Participants received 32 test trials without voice-over description, so that each of the eight boxes occurred exactly four times.As in study 2, we employed eight boxes in which the target could land.
In a first round, 60 participants took part on the first testday.Additional five subjects returned their submission, i.e., decided to leave the study early or withdrew their submission after study completion.For the first testday, participants were compensated with £1.25.We estimated an average completion time of 9 minutes, resulting in an estimated hourly rate of £8.33.On average, participants took 07:33min to complete the first part.
Participants were compensated with £1.50 for completing the second part of the study.We estimated an average completion time of 9 minutes, resulting in an estimated hourly rate of £10.On average, participants took 07:50min to complete the second part of the study.
Since we aimed for a minimum sample size of 60 subjects participating on both testdays, we reran the first testday with additional 50 participants.Additional eight subjects returned their submission, i.e., decided to leave the study early or withdrew their submission after study completion.One additional participant timed out, i.e., did not finish the survey within the allowed maximum time.Again, participants were compensated with £1.25 for completing the first part of the study (estimated average completion time 9 minutes, estimated hourly rate of £8.33).On average, participants took 07:37min to complete the first part.
Of the additional 50 participants that completed testday 1, 28 subjects finished testday 2. Additional three subjects returned their submission, i.e., decided to leave the study early or withdrew their submission after study completion.One additional participant timed out, i.e., did not finish the survey within the allowed maximum time.
Again, participants were compensated with £1.50 for completing the second part of the study (estimated average completion time 9 minutes, estimated hourly rate of £10).On average, participants took 06:30min to complete the second part of the study.

Instructions and voice-over descriptions
This is the content of our audio recordings that were played as instructions and during voice-over trials.

Figure 1 .
Figure 1 .Imprecision by trial type, split by study version and sample.The x axis represents the trial type.The y axis represents imprecision, i.e., the absolute distance between the target's center and the participant's click.The unit of imprecision is counted in the width of the target, i.e., a participant with imprecision of 1 clicked one target width to the left or right of the true target center.Small dots show the imprecision for each subject in each trial.Boxplots (boxes represent first to third quartiles of data; vertical lines indicate the median; horizontal black lines display the range) and a half violin plot show the data distribution.

Figure 2 .Figure 3 .
Figure 2 .Imprecision across test trials, split by study version and sample.The x axis represents trial number.The y axis represents imprecision, i.e., the absolute distance between the target's center and the participant's click.The unit of imprecision is counted in the width of the target, i.e., a participant with imprecision of 1 clicked one target width to the left or right of the true target center.The black dashed regression lines show smooth conditional means based on linear models.Small colored dots show the imprecision for each subject in each trial.Colored lines connect the trials of each individual.

Figure 4 .
Figure 4 .Reliability split by age group.(A) Internal consistency (odd-even split) in hedge child sample by age group.(B) Test-retest reliability in hedge child sample by age group.(C) Internal consistency (odd-even split) in box child sample by age group.(D)Test-retest reliability in box child sample by age group.For the hedge version, performance is measured as imprecision, i.e., the absolute distance between the target's center and the participant's click (averaged across trials).The unit of imprecision is counted in the width of the target, i.e., a participant with imprecision of 1 clicked on average one target width to the left or right of the true target center.For the box version, performance is measured as the proportion of correct responses, i.e., how many times the participant clicked on the box that contained the target.Regression lines with 95% CI show smooth conditional mean based on a linear model (generalized linear model for box version), with Pearson's correlation coefficient r.Dots show the performance for each subject.The color of data points denotes age group.
Figure5.Setup of the oREV.On each trial, participants heard a word and were asked to select the corresponding picture.Verbal prompts could be replayed by pressing the loudspeaker button.
The number of people and, more specifically, children, as well as the more diverse their age, the more likely children were to understand the agent's gaze cue.The only predictor resulting in a negative estimate was the age at which a participant entered childcare, i.e., the later a child entered, the better performance in the task.Effect sizes were probably influenced by Results.Note that we did not find a great difference in WAIC scores between the compared models (see Supplements for WAIC scores and weights).The model estimates were all considerably smaller than estimates of age, study version and data collection, and all 95% CIs included zero.Nevertheless, a general pattern emerges: exposure to a more variable social environment positively influenced children's gaze understanding.thelack of variance in the predictors: variables like household size and number of siblings typically vary very little among German households (see distribution characteristics of predictor variables below).