For most of us, the Internet is part of everyday life. Over half of the world’s population (51%) now uses the Internet, and this proportion is even higher for young people (69%), especially those living in developed countries (98%; International Telecommunication Union, 2020). The COVID-19 pandemic increased the amount of time people spend on the Internet while restricting in-person contact, making online testing an attractive option for psychological research. Even before the pandemic, online methods were increasingly used as an alternative to in-person research conducted in the laboratory (e.g., Chetverikov & Upravitelev, 2015; Houben & Wiers, 2008; Milne et al., 2020; Smith & Leigh, 1997; Taherbhai et al., 2012), while the emergence of a number of online platforms provided new tools for recruitment and testing (e.g., Gosling & Mason, 2015; Grootswagers, 2020).

Although there are legitimate concerns about online testing, such as lack of control over characteristics of the samples and testing contexts (e.g., Birnbaum, 2004; Krantz & Dalal, 2000), online studies have several features that make them equivalent or even superior to in-person testing (e.g., Casler et al., 2013; Dandurand et al., 2008; Gosling et al., 2004). First, data quality can be similar, in the sense that the findings are similar. Second, Internet samples can be more diverse and representative of the general population in terms of age, gender, and socioeconomic status, particularly when compared to samples comprised solely of college students registered in introductory psychology courses. Third, access to relatively rare target audiences, such as musicians, tends to be easier. Fourth, participants may feel more comfortable and act more naturally at home than when they come to a laboratory. Fifth, building an online experiment, recruiting participants, and collecting data can be more efficient in terms of time and costs, especially when responses are scored and recorded automatically on the hosting platform. Finally, online experiments are not limited to the space and time constraints of a laboratory.

Despite these benefits, online testing needs specific exclusion criteria, careful experimental designs that maximize control (e.g., Gosling et al., 2004), and appropriate motivational strategies (e.g., promising feedback at the end) to improve the likelihood that participants complete the whole experiment. Auditory research, and temporally based experimental tasks in general, can be particularly challenging, because compared to the laboratory, online testing occurs in contexts that are more variable and uncontrolled in terms of extraneous sounds, technical aspects of stimulus presentation, and potential interruptions (e.g., Milne et al., 2020). Although this variability can be reduced by asking participants to follow specific instructions (e.g., to wear headphones), experimental control remains limited.

How similar are the findings from in-person and online experiments? Positive results come from an online study about reinforcement learning (Nussenbaum et al., 2020), which replicated a main effect of age that was reported in an earlier in-person study (Decker et al., 2016). In other developmental research, online data replicated a mediating role for abstract reasoning ability in the link between age and model-based learning (Chierchia et al., 2019). In non-developmental research, Houben and Wiers (2008) found that an implicit association test was effective at identifying alcohol-related associations whether it was administered online or in person.

Although there is substantial evidence that simple tasks can be reliably adapted for online testing, an open question is whether longer and more cognitively demanding tasks can be similarly adapted. In one instance, Dandurand et al. (2008) adapted a complex problem-solving task (from Dandurand et al., 2004) for online testing. Across platforms, participants’ performance was better when they observed or read instructions on how to solve the problem successfully, compared to when they were simply given feedback on their decisions. Nevertheless, online participants were less accurate in general than in-person participants, even though the testing format did not influence the main effect of the learning manipulation (i.e., no interaction).

In the present investigation, we used the platform Gorilla (; Anwyl-Irvine et al., 2020) to create an online version of an objective measure of musical ability—the Musical Ear Test (MET). The MET is a listening test that has documented reliability and validity (Swaminathan et al., 2021; Wallentin et al., 2010a, 2010b). It is designed in the tradition of musical aptitude (i.e., natural musical ability) tests, with two subtests, Melody and Rhythm, both of which require participants to determine, on multiple trials, whether two auditory sequences (a standard followed by a comparison) are identical. Musical aptitude tests, dating back to the early twentieth century (Bentley, 1966; Gordon, 1965; Seashore, 1919; Seashore et al., 1960; Wing, 1962), were designed to identify whether musically untrained individuals (primarily children) are likely to benefit from music lessons, based on the view that people with little natural ability would be unlikely to benefit in this regard. These older tests, as well as more recent tests of musical ability (Asztalos & Csapó, 2014; Fujii & Schlaug, 2013; Law & Zentner, 2012; Peretz et al., 2003, 2013; Ullén et al., 2014; Zentner & Strauss, 2017), all require same–different comparisons of two auditory events that differ in pitch (e.g., melody) or time (e.g., rhythm), or along other dimensions such as timbre and amplitude. In other words, the tests rely on core musical skills, specifically auditory short-term (working) memory and perceptual discrimination. As a broad phenotype, musical ability incorporates many other aspects of behavior (e.g., expert levels of performance, long-term memory for melodies) that are dependent on learning and practice. The goal of tests such as the MET is to measure musical ability in the absence of any formal training, and to do so objectively and quickly.

We also used Gorilla to run the entire testing session, which included measures of general cognitive ability and personality, and to create an online version of a self-report measure of musical behavior and expertise—the Goldsmiths Musical Sophistication Index (Gold-MSI; Lima et al., 2020; Müllensiefen, et al., 2014). The Gold-MSI served as our principal measure of construct validity. Virtually all developers of tests of musical ability report positive correlations with musical expertise as a means of documenting a test’s validity (Asztalos & Csapó, 2014; Law & Zentner, 2012; Wallentin et al., 2010a; Zentner & Strauss, 2017; Ullén et al., 2014).

We compared response patterns from our online sample with previous studies that had large samples of participants: Swaminathan et al. (2021, N = 523) for the MET, and Lima et al. (2020, N = 408) for the Gold-MSI. Specifically, we compared the present sample with these comparison samples in terms of their psychometric characteristics, including internal reliability, construct validity, correlations between subtests, and correlations between musical ability and musical sophistication. We also tested for associations with demographic variables, cognitive ability, and personality, because previous studies have shown robust associations with these variables (e.g., Cooper, 2019; Greenberg et al., 2015; Kuckelkorn et al., 2021; Lima et al., 2020; Moreno et al., 2011; Swaminathan et al., 2021). Absolute levels of performance on our measures could vary across samples depending on the degree to which they differ in music training, age, cognitive ability, personality, education, and so on. In terms of age and education, Lima et al. tested Portuguese individuals from the general population who varied widely, whereas Swaminathan et al. tested Canadian undergraduates who varied minimally.

Because the Gold-MSI has a history of online and in-person testing (Correia et al., 2020; Greenberg et al., 2015; Lima et al., 2020; Müllensiefen et al., 2014; Schaal et al., 2015), we predicted that results from our online version of the test would be similar to those from the paper-and-pencil administration of Lima et al. (2020), with similar psychometric properties. We were less certain of the outcome with the online version of the MET, primarily because technological requirements were much greater for an objective listening test, which required participants to determine, on each of 104 trials, whether two auditory sequences were identical.

In short, our main objective was to determine whether the MET could be successfully administered online. Evidence of success required that the test’s internal reliability would not be compromised by online administration, that performance would correlate positively with musical expertise, and that musical ability would have positive associations with general cognitive ability. Moreover, musical expertise should be a better predictor of scores for the Melody subtest of the MET than for the Rhythm subtest, as is the case with in-person testing (Swaminathan et al., 2021). Other findings from previous research (Swaminathan & Schellenberg, 2018; Butkovic et al., 2015) indicated that the online test’s success would be further supported by a positive correlation with scores on one (and only one) dimension from the Big Five model of personality (McCrae & Costa, 1987; McCrae & John, 1992): openness to experience.

More novel aspects of the present study included our prediction that mind-wandering would be associated negatively with performance on the MET, because the MET required participants to concentrate for 18 min. One might also expect lower levels of mind-wandering among individuals who have taken music lessons for a longer period of time, because learning to play music requires much time, effort, and focus. Our use of the Gold-MSI as a measure of musical expertise allowed us to explore whether aspects of musical expertise other than training were predictive of performance, and whether their predictive power would vary across subtests. Previous studies of musical ability restricted tests of construct validity to associations with musicianship status, amount of daily practice, duration of music training, or involvement in professional music-related activities (Law & Zentner, 2012; Swaminathan et al., 2021; Ullén et al., 2014; Wallentin et al., 2010a). The Gold-MSI allowed us to examine whether musical ability would also be associated with active engagement with music, emotional responding to music, and self-reports of singing and perceptual abilities. Such associations would confirm that the narrow range of abilities tested by the MET is predictive of a much broader range of musical abilities.



A total of 754 participants were tested originally. We subsequently excluded participants who did not complete the MET (n = 100) or failed to respond on several trials on either the Melody or the Rhythm subtest, which we defined as more than 10 trials in total (n = 39) or more than 5 in a row (n = 7). The final sample included 608 participants (361 female, 243 male, 4 unreported) between 18 and 88 years of age (M = 34.2, SD = 15.1). Most had completed high school (n = 207) or had a university degree (bachelor’s, n = 108, master’s, n = 191, Ph.D., n = 58). Only three participants had less than 10 years of education. Education data were missing for 41 participants.

Participants were recruited primarily through snowball sampling and social media posts, which read: Do you like music? Do you know anyone who does? We are running an online study on personality and musical abilities. We are looking for listeners with all kinds of musical backgrounds. A subsample of undergraduate students was recruited via email and received partial course credit for their participation. The experiment was available in four languages, and participants were instructed to complete it in their native language (Italian, n = 288; European Portuguese, n = 153; Brazilian Portuguese, n = 123; English, n = 44). Informed consent was collected from all participants, and ethical approval for the study protocol was obtained from the local ethics committee at ISCTE-IUL (reference 07/2021).

Participants varied widely in terms of music training. Half had no history of music lessons (n = 151) or a maximum of 2 years (n = 133), but 156 had 10 years or more. The training included private lessons (n = 123), or classes taught at university (n = 122) or in musical academies or conservatories (n = 84). Others (n = 85) were self-taught. On average, participants with music lessons started their training at the age of 11.4 years (SD = 7.1; range: 2–56). The relatively high proportion of participants with extensive backgrounds in music was presumed to stem from their personal interest in the study.


All tasks and questionnaires, created originally in English, were adapted for online testing using Gorilla Experiment Builder (Anwyl-Irvine et al., 2020). Validated translations of the measures (e.g., the Big Five Inventory in European-Portuguese and Italian) were used when available. When a task or questionnaire was not available for our target languages, instructions and items were translated by bilinguals who were native speakers and also fluent in English.

Online versions of the MET and the Gold-MSI are available on Gorilla for other researchers to use (

Objective behavioral tests

Musical ability

An online version of the Musical Ear Test (MET; Wallentin et al., 2010a) was used to evaluate music perception abilities. We attempted to make the online experience as similar as possible to in-person testing, when the test is installed on a personal computer in the laboratory, and participants listen to stimuli over headphones and record their responses on an answer sheet. As in the original version, the online MET had two subtests, Melody and Rhythm (in that order), each of which had 52 trials. On each trial, participants listened to two short musical excerpts (a standard followed by a comparison) and made a yes/no judgment about whether the comparison was the same as the standard. On both subtests, half of the trials were same and half were different. The stimuli and order of presentation were the same as in the original test. All musical excerpts had the same metrical structure (4/4 time) and tempo (100 beats per minute). A lower-amplitude metronome sound indicated the underlying beat. Each subtest was preceded by two practice trials (one same, one different). Feedback was provided for practice trials but not for test trials. Detailed descriptions of MET stimuli are provided in Swaminathan et al. (2021).

In the original test, all instructions and trials are presented via an 18-min digital audio file, with task instructions and the number of each trial provided by a male speaker. Trials are not self-paced. Rather, participants are given a brief window after each trial (1500 ms for melodic trials, 1659 to 3230 ms for rhythmic trials) to respond by checking yes or no on a response sheet. In our online adaptation of the MET, instructions and trial numbers were converted to text that participants read. The actual stimuli from each trial were digitally copied from the original audio file and the duration of the inter-stimulus intervals was preserved, such that the total duration (approximately 20 min) of the MET was identical to the in-person version. The trial number and the question (e.g., Are the melodic phrases identical?) were visible on the screen from the beginning of each trial until the participant responded. Immediately after the audio stimulus ended, two buttons—labeled Yes and No—appeared, and participants had a few moments to respond by clicking the appropriate button. Examples of MET stimuli are illustrated in musical notation in Fig. 1.

Fig. 1
figure 1

Example trials from the MET Melody and Rhythm subtests. Reprinted by permission from Springer, Behavior Research Methods, “The Musical Ear Test: Norms and correlates from large sample of Canadian undergraduates,” Swaminathan, Kragness, & Schellenberg (2021), advance online publication, 11 March 2021, doi: 10.3758/s13428-020-01528-8

To enhance the online testing experience, we provided a progress bar at the bottom of the screen throughout both subtests, such that participants could monitor where they were in relation to the beginning and end of the subtest. We also provided feedback at the end of the test about the participant’s performance, which was calculated as the total number of correct responses on the Melody and Rhythm subtests. For statistical analyses, a Total score was also calculated as the sum.

General cognitive ability

Our measure of general cognitive ability (hereafter cognitive ability) was the Matrix Reasoning Item Bank (MaRs-IB; Chierchia et al., 2019), an online test of abstract (nonverbal) reasoning modeled after Raven’s Advanced Progressive Matrices (Raven, 1965). On each of 80 trials, a 3 × 3 matrix was presented on the computer screen. Eight of nine cells contained abstract shapes, but the ninth (bottom-right) cell was always empty. Participants’ task was to complete the matrix by choosing one of four alternatives. Two examples are provided in Fig. 2. Associations among shapes could vary on a single dimension for the simplest trials (e.g., color), but on up to four dimensions (e.g., color, size, shape, and location) for more difficult trials.

Fig. 2
figure 2

Two example trials from the Matrix Reasoning Item Bank (MaRs-IB). The third and fourth options are the correct responses for the upper and lower examples, respectively

On each trial, before the matrix was presented, a 500-ms fixation cross appeared in the middle of the screen, followed by a 100-ms white screen. Participants then had up to 30 s to look at the matrix and select a response. The trial ended earlier if participants responded. If no response was provided after 25 s, a clock appeared and indicated the time remaining.

The order of the trials was the same for all participants. The first five items were relatively easy so as to familiarize participants with the task. Although the duration of the entire task was fixed at 8 min, participants were not informed of the task duration or the number of trials—only that they had up to 30 s to complete each trial. If they completed the 80 trials in less than 8 min, the trials were presented again in the same order, but responses from the second round were not considered in calculating scores. Scores were calculated as the proportion of the total number of responses given by the participant that were correct. For the statistical analyses, proportions were logit-transformed.


Musical expertise

Our principal measure for tests of construct validity was the Gold-MSI (Müllensiefen et al., 2014), a self-report questionnaire of musical expertise and behavior. The Gold-MSI has 38 items that evaluate different behaviors related to music (e.g., I spend a lot of my free time doing music-related activities). Although the items are mixed in terms of order of presentation, for scoring purposes they are grouped to form five subtests: Active Engagement (9 items), Perceptual Abilities (9 items), Music Training (7 items), Singing Abilities (7 items), and Emotions (6 items). A General Musical Sophistication factor is also calculated from 18 items that are representative of the five subtests. For the first 31 items, participants judge how much they agree with each statement on a seven-point rating scale (1 = completely disagree, 7 = completely agree). For the final seven items, participants select one of seven alternatives from an ordinal scale that varies from item to item. For example, the scale for the statement I listen attentively to music for … had options ranging from 1 (0 - 15 min per day) to 7 (4 hours or more per day).

For European-Portuguese participants, we created an online version of a published translation of the Gold-MSI that has good psychometric properties (Lima et al., 2020). For the Italian translation, items from the original English version were translated to Italian independently by two translators, both of whom were native speakers of Italian, fluent in English, experienced in translating questionnaires, and experts in the psychology of music. The goal was conceptual equivalence rather than a literal translation. Discrepancies between translations were resolved by discussion to create a single version, which was, in turn, evaluated by two independent colleagues for clarity of expression and whether the translation from English was appropriate. The Italian version was then back-translated by a native speaker of English who was fluent in Italian and a scholar of psychology and music. Inconsistencies between the back-translation and the original Gold-MSI were discussed and resolved among the three translators, who also consulted with two additional experts from the discipline. Finally, 10 participants completed the Italian translation of the Gold-MSI and confirmed that the items were clear.

For the Brazilian-Portuguese version, a native speaker, who was also fluent in English and an expert in the psychology of music, made minor modifications to the European-Portuguese version. To ensure that each modification was consistent with the original Gold-MSI, she first checked the English version. Such modifications included the progressive tense (I am hearing translated to estou ouvindo instead of estou a ouvir), the second-person pronoun (replacing tu with você), some Brazilian-Portuguese idioms, and minor changes in spelling.

Cronbach’s alphas for the entire sample and for the previously unpublished (Italian and Brazilian-Portuguese) translations of the Gold-MSI are provided in Supplementary Table 1. In general, internal reliability was similar to the comparison sample (Lima et al., 2020), except for a lower alpha in the present sample for the Emotions subtest. Internal reliability was maintained for the previously unpublished translations.


Personality traits were evaluated with the Big Five Inventory (BFI). The BFI is a self-report questionnaire with 44 items that assess five dimensions of personality: openness to experience (10 items), conscientiousness (9 items), extroversion (8 items), agreeableness (9 items), and neuroticism (8 items). Items are mixed in terms of presentation order. Participants rated how much each expression describes them using a five-point rating scale (1 = disagree strongly, 5 = agree strongly).

The BFI was published initially in English (John & Srivastava, 1999), and subsequently translated into European-Portuguese (Brito-Costa et al., 2015) and Italian (Ubbiali et al., 2013). We created a Brazilian-Portuguese version by modifying the European-Portuguese version, double-checking the original English version for fidelity. Cronbach’s alphas for the BFI were acceptable and are provided in Supplementary Table 2.


As a measure of sustained attention and ability to focus, participants completed the Mind-Wandering Questionnaire (MWQ, Mrazek et al., 2013), a five-item scale with good psychometric properties that evaluates trait levels of mind-wandering (e.g., I have difficulty maintaining focus on simple or repetitive work). Participants rated how much they agreed with each sentence on a scale that ranged from 1 (almost never) to 6 (almost always). Cronbach’s alphas for the MWQ were good and are provided in Supplementary Table 2.


Participants completed all tasks and questionnaires in one testing session. Access to the experiment was initially provided with a hyperlink posted on social media (e.g., Facebook, Twitter, LinkedIn), which was accompanied by a brief description of the study, including its duration of approximately 40 min. The description also specified that participants should complete the testing session in a quiet room with a stable Internet connection, use headphones, and turn off sound notifications from other devices and applications (e.g., email, phone messages).

The online testing session began with informed consent and some basic demographic questions (e.g., age, gender, education). Participants then completed the self-report questionnaires, which were administered in a fixed order (MWQ, Gold-MSI, and BFI). After the questionnaires, participants were tested on the MaRs-IB and finally the MET. At the end of the study, participants were given feedback about their scores on the personality, musical sophistication, and musical ability measures. A final open-ended question asked participants to describe any problems that might have occurred during the testing session. Some participants reported minor technical difficulties, related primarily to the stability of their Internet connection, but there were otherwise no systematic problems.


The complete data file is provided in the Supplementary Materials. As in the reports from the comparison samples (Lima et al., 2020; Swaminathan et al., 2021), the statistical analyses incorporated standard frequentist null-hypothesis testing, as well as Bayesian analyses conducted with JASP version 0.14.1 (JASP Team, 2020) using default priors.Footnote 1 Because of the large sample, very small effects were statistically significant with null-hypothesis testing. For example, with N = 608, correlations greater than .08 in absolute value were significant with p < .05. We considered small associations to be reliable only if they also passed a conventional threshold for what is considered substantial evidence using Bayesian statistics (Jarosz & Wiley, 2014; Jeffreys, 1961). Specifically, when the Bayes factor (BF10, reported here with three-digit accuracy) was greater than 3.00, the observed data were at least three times as likely under the alternative as the null hypothesis. Lower values (1.00 < BF10 < 3.00) indicated that the data provided evidence for the alternative hypothesis that was considered to be weak or anecdotal. If BF10 < 1.00, the observed data provided evidence that favored the null hypothesis in a reciprocal manner (i.e., substantial evidence when BF10 < .333). More extreme values provided strong (BF10 > 10.0 or < .100), very strong (BF10 > 30.0 or < .033), and decisive (BF10 > 100.0 or < .010) evidence for either the alternative or null hypothesis, respectively.

Initial analyses documented how the present online sample of participants differed from comparison samples in terms of gender, age, and music training. Detailed statistics are provided in the Supplementary Materials. The present sample had a larger proportion of participants who were men, and the mean age was higher than in Swaminathan et al. (2021) but similar to Lima et al. (2020). Mean levels of music training were higher in the present sample than in both comparison samples.

Swaminathan et al. (2021) did not report personality data, and their sample of undergraduates varied minimally in terms of education. Comparisons with the sample from Lima et al. (2020) revealed that the present sample had lower mean levels of education. For personality (Supplementary Table 3), the two samples differed for each trait, with the present sample scoring higher on openness to experience and neuroticism, but lower on agreeableness, extroversion, and conscientiousness.

The main analyses focused on musical ability, musical experience, and their correlates, including demographics (age, gender, education), cognitive ability, personality, and mind-wandering. Pairwise correlations among potential predictors are provided in Supplementary Table 4. We had no hypotheses about the testing language of the online study, and exploratory analyses confirmed that musical ability did not vary as a function of language when individual differences in age, education, cognitive ability, and openness to experience were held constant. In fact, for the Melody subtest, the Rhythm subtest, and Total scores of the MET, the observed data provided substantial evidence for the null hypothesis (all BF10 < .250). Testing language was not considered further.

Musical expertise

Because of the large number of musicians in the current sample, mean scores were higher than they were in Lima et al. across subtests and the General Factor, ps < .001, all BF10 > 100 (Supplementary Table 1). As in the comparison sample and elsewhere (Müllensiefen et al., 2014), pairwise correlations among Gold-MSI scores were all positive, and the observed data provided decisive evidence for an association in each instance (Supplementary Table 5). Examination of correlations between Gold-MSI scores and potential predictor variables revealed a relatively small number of instances in which the observed data provided substantial or stronger evidence for an association (Supplementary Table 6).

For demographic variables (age, gender, education), there was decisive evidence of a negative association between age and scores on the Emotions subtest. There was also strong evidence that men had more Music Training than women, and substantial evidence for a male advantage on the General Factor. Cognitive ability had no significant associations with Gold-MSI scores, and the observed data provided substantial (or strong) evidence for the null hypothesis for all subtests. As expected, there was strong evidence for a small, negative association between mind-wandering and the Music Training subtest, but mind-wandering was not associated with any other Gold-MSI score. For personality, openness to experience was associated decisively and positively with all Gold-MSI scores (rs ≥ .4). The observed data also provided decisive and substantial evidence for positive but small associations between extroversion and Singing Abilities, and between agreeableness and Music Training, respectively (rs ≤ .2).

Musical ability

Statistics from tests of internal reliability for the online MET are provided in Table 1. Cronbach’s alphas were virtually identical to those reported by the test’s developers (Wallentin et al., 2010b), and higher than those reported in the comparison sample (Swaminathan et al., 2021). Split-half (odd–even) reliabilities (Spearman-Brown formula) were also considerably higher than those reported by Swaminathan et al. In short, the internal reliability of the MET was not compromised by the online testing format.

Table 1 Reliability statistics, including Cronbach’s alpha and split-half (odd-even) correlations (Spearman-Brown formula), for scores on the MET. For comparison purposes, values from two previous reports are provided

Descriptive statistics for the Melody, Rhythm, and Total scores are provided in Table 2. For the entire sample, the observed means were higher than those reported by Swaminathan et al. (2021) for the Melody, Rhythm, and Total scores, as confirmed by independent-samples t tests, ts(1129) = 5.06, 5.90, and 6.23, respectively, ps < .001, all BF10 > 100. These findings were not meaningful, however, because of sample differences in musicianship. To rectify this problem, we gave separate consideration to individuals with no music training (see Table 2). For these participants, mean performance did not differ from that reported previously on the Melody subtest, p = .202, BF10 = .263, the Rhythm subtest, p = .053, BF10 = .725, or for Total scores, p = .064, BF10 = .625, although evidence favoring the null hypothesis was substantial only for the Melody subtest. In any event, online-generated scores were comparable to in-person scores when they were expected to be comparable.

Table 2 Descriptive statistics for scores on the MET. Melody and Rhythm scores were calculated from 52 trials. Total scores were calculated from 104 trials. For comparison purposes, values from Swaminathan et al. (2021) are provided

As one would expect, Melody and Rhythm scores were positively and decisively correlated, r = .551, N = 608, p < .001, BF10 > 100, with the magnitude of the association no different from that reported by Swaminathan et al. (2021), r = .489, p = .154, and Wallentin et al. (2010a), r = .520, p = .754.Footnote 2 As in the earlier reports, the data provided substantial evidence that performance did not differ between subtests, BF10 = .214.

Demographics, cognitive ability, mind-wandering, and personality

Correlations between MET scores and demographic variables, cognitive ability, mind-wandering, and personality are provided in Table 3. The observed data provided decisive evidence that as listeners increased in age, education, or cognitive ability, performance on the MET (i.e., Melody, Rhythm, and Total scores) tended to improve as well. The one exception was the association between cognitive ability and Melody scores, for which the data provided substantial rather than decisive evidence. The correlation with cognitive ability was also higher for the Rhythm than for the Melody subtest, z = 2.87, p = .004.

Table 3 Pairwise associations (Pearson correlations and Bayes factors) between scores on the MET and demographic variables, cognitive ability, mind-wandering, and personality

For mind-wandering, there was substantial evidence for a negative association with scores on the Melody subtest, but no evidence of an association with Rhythm or Total scores. Nevertheless, the magnitude of the association was not significantly stronger for Melody than for Rhythm, p > .1. For personality, the observed data provided decisive evidence for positive associations between openness to experience and MET performance, but no evidence for associations with any other personality variable. In fact, all Bayes factors were below 1 with a single exception, and for two personality traits (conscientiousness, extroversion), the observed data provided substantial evidence for the null hypothesis.

Musical expertise and music training

Our main tests of construct validity involved correlations between scores on the MET and those from the subtests and General Factor from the Gold-MSI, which are provided in Table 4. All correlations were positive and statistically significant, with p < .001, with the observed data providing decisive evidence for an association in each instance, except for the association between the Emotions subtest and Rhythm scores, which was strong but not decisive.

Table 4 Pairwise associations (Pearson correlations and Bayes factors) between scores on the MET and scores on the Gold-MSI (N = 608)

In the comparison sample (Swaminathan et al., 2021), music training proved to be a better predictor of Melody than of Rhythm scores. Our Gold-MSI scores showed a similar pattern. For Perceptual Abilities, Music Training, Singing Abilities, and the General Factor, correlations with the Melody subtest were higher than those for the Rhythm subtest, zs > 4, ps < .001. The same finding was weaker yet still evident for Active Engagement, z = 3.16, p = .002, but not for the Emotions subtest, p = .086.

Additional analyses focused solely on the Music Training subtest. Associations between Music Training and MET scores (see Table 4) were higher than those in the comparison sample (Swaminathan et al., 2021), which could be due to differences in how training was measured and/or a consequence of greater variability due to the higher proportion of musicians in the present sample. The correlations were somewhat lower than correlations between MET scores and current daily practice reported by Wallentin et al. (2010a, Experiment 3), a likely consequence of differences in measurement.

We also asked whether performance on the MET was associated with the age at which music training began. As in Swaminathan et al. (2021), we considered only participants who had any training (n = 415) and divided them into two groups: those who started by age 7—early starters (n = 120)—and those who started at an older age—late starters (n = 295). This split was theoretically motivated, based on the proposal of a sensitive period that extends up to 7 years of age, during which plasticity is greater and music training is presumed to have a stronger impact on development (Penhune, 2019, 2020; Penhune & De Villiers-Sidani, 2014).

The results were similar to those reported in the comparison sample (Swaminathan et al., 2021). Early starters had higher scores than late starters on the Melody subtest, t(413) = 3.18, p = .002, BF10 = 14.7, and on Total scores, t(413) = 2.96, p = .003, BF10 = 7.82, but not on the Rhythm subtest, p = .076, BF10 = .543. Nevertheless, early starters also had more Music Training, t(413) = 4.11, p < .001, BF10 > 100. When Music Training was held constant, the advantage for early starters disappeared for the Melody subtest, p = .078, BF10 =.577, and for Total scores, p = .083, BF10 = .527, although the observed data did not provide strong evidence for the null hypothesis.

Multiple regression analysis

In the final set of analyses, we used multiple regression to determine which correlates made independent contributions in predicting performance on the MET. Specifically, we modeled MET Melody, Rhythm, and Total scores from a linear combination of variables, each of which had a reliable simple association with MET scores: age, education, cognitive ability, mind-wandering, openness to experience, and the Gold-MSI subtests. The results are summarized in Table 5. For the Melody subtest, the Rhythm subtest, and Total scores, the overall model was significant, with independent and positive partial associations with age, education, cognitive ability, and the Perceptual Abilities and Music Training subtests from the Gold-MSI.

Table 5 Multiple regression results predicting MET scores from age, education, openness to experience, cognitive ability, mind-wandering, and the five Gold-MSI subtests

In the Bayesian counterpart to multiple regression, we first identified which model—out of all possible models—was most likely given the observed data. For the Melody subtest and for Total scores, it was a model that included age, education, cognitive ability, Perceptual Abilities, and Music Training—a finding that corroborated the frequentist results. We calculated a Bayes factor for each predictor by removing them from the model one at a time. As shown in Table 5, the observed data provided decisive evidence for the inclusion of Perceptual Abilities and Music Training in the model, and very strong (Melody) or decisive (Total) evidence for including cognitive ability and age. For education, however, the Bayes factor was less than 3. We calculated BF10 for the other (excluded) five variables by adding each to the model one at a time. For each variable, the observed data provided substantial evidence for the null hypothesis. In other words, the observed data were more likely with a model that did not include these variables.

For the Rhythm subtest, the best model of the data included age, cognitive ability, Perceptual Abilities, and Music Training. The observed data provided decisive evidence for the inclusion of age, cognitive ability, and Perceptual Abilities in the model, but only substantial evidence for including Music Training. For the other six variables, the observed data provide substantial evidence for the null hypothesis with one exception: they were more or less equally likely with a model that included or excluded education.


We sought to determine whether an established and validated test of musical ability could be administered successfully online. Although approximately 20% of the sample who started the testing session did not complete it or provide usable data, this level of attrition is not surprising, because there was no compensation or incentive for participants to complete the session, other than to receive feedback about their personality, musical expertise, and musical ability. Moreover, the testing session was relatively long and, unlike in a laboratory, there were no research assistants to witness a participant’s decision to discontinue. In any event, the findings were otherwise unequivocally positive. Indeed, the results for the MET were both novel and noteworthy because it is an objective listening test of musical ability that, to our knowledge, has not been adapted previously for online testing.

The Gold-MSI served as our main variable for testing construct validity and as a proof of concept—that the present sample of online participants would respond similarly to a sample of participants tested in a more traditional format (Lima et al., 2020). Indeed, response patterns to the online Gold-MSI were very similar to those reported previously. For example, the internal reliability of the test was similar across formats except for the Emotions subtest. As in the earlier study, age correlated negatively with the Emotions subtest, although Lima et al. found a negative correlation between age and all Gold-MSI subtests. Discrepancies in response patterns between samples could stem from differences in music training. Compared to the previous study, we had a larger subsample of participants with very high levels of music education; one-quarter of our sample (25.6%) had 10 or more years of music lessons, whereas in Lima et al., the figure was closer to one-twentieth (5.6%). Because increases in musical experience must be accompanied by increases in age, a negative association between age and Gold-MSI scores would be less likely in our online sample. Despite these differences in samples, correlations among Gold-MSI subtests, and between Gold-MSI scores and personality variables, were similar across testing formats.

One null finding was that there was little evidence of an association between cognitive ability and the Music Training subtest from the Gold-MSI. In childhood, music training is often correlated positively with cognitive ability (Corrigall et al., 2013; Corrigall & Schellenberg, 2015; Kragness et al., 2021; Schellenberg, 2006, 2011; Schellenberg & Mankarious, 2012; Swaminathan & Schellenberg, 2020). In adulthood, however, such associations tend to be weaker (Lima & Castro, 2011; Schellenberg, 2006). When matrix-type tests of cognitive ability, such as Raven’s test and the test used in the present sample (MaRs-IB), are given to students from an introductory psychology course, positive associations with music training are evident in some instances (Swaminathan et al., 2017, 2018, 2021; Swaminathan & Schellenberg, 2018) but not in others (Schellenberg & Moreno, 2010; Swaminathan & Schellenberg, 2017). These associations may become less likely in samples of older participants with a large proportion of professional musicians (Lima & Castro, 2011).

Turning now to our main focus, the MET, the internal reliability of the online version proved to be similar to, perhaps even better than, in-person administration (Wallentin et al., 2010b; Swaminathan et al., 2021). Other results confirmed that (1) the correlation between Melody and Rhythm subtests did not differ across formats, (2) there was no difference in performance between subtests, and (3) when the present and comparison samples were equated for music training by focusing solely on participants with no training, average levels of performance were similar. Moreover, as in the comparison sample, there were no gender differences in performance on the MET. Finally, as in other samples, performance was strongly associated with openness to experience, but not with other dimensions of personality (Greenberg et al., 2015; McCrae & Greenberg, 2014; Swaminathan & Schellenberg, 2018; Thomas et al., 2016). In short, online testing did not compromise the reliability and validity of the MET.

Strong evidence of construct validity for our online version of the MET came from positive associations with scores on the Gold-MSI. Previous in-person studies documented that as the degree of musicianship and amount of practice (Wallentin et al., 2010a) or duration of music training (Swaminathan et al., 2021) increases, so does performance on the MET. In the present investigation, associations with Music Training as measured by the Gold-MSI were somewhat higher than those of the comparison sample (Swaminathan et al., 2021), which we attribute to the relatively high variability in music training and the high proportion of professional musicians tested online. We also found positive associations between MET scores and other aspects of self-reported musical expertise measured by the Gold-MSI, namely Active Engagement, Emotions, Perceptual Abilities, and Singing Abilities. In the Gold-MSI validation study, Müllensiefen et al. (2014) reported a comparable pattern of associations using short beat alignment and melodic memory tasks. Our results extended these associations, indicating that musical skills and experience are multifaceted, and not limited to music lessons or playing an instrument. Moreover, even though the musical skills tested by the MET are based on auditory short-term (working) memory and perceptual discrimination, performance was predictive of a broad range of musical behaviors and expertise.

As in the comparison sample, we found no association between musical abilities and age of onset of music lessons after duration of music training was held constant. This finding raises the possibility that proposals of plasticity effects arising from early music training (Penhune, 2019, 2020; Penhune & De Villiers-Sidani, 2014) may be exaggerated. Indeed, longitudinal evidence in childhood shows that musical ability is independent of music training when levels of musical ability measured 5 years previously are taken into account (Kragness et al., 2021). Nevertheless, other findings reveal behavioral advantages and structural brain differences as a consequence of early training, even after accounting for duration of training (Bailey et al., 2014; Bailey & Penhune, 2010, 2012, 2013). Perhaps early onset of music training explains some musical abilities, such as rhythm synchronization and production abilities, but not other abilities, such as those measured by the MET.

As noted, one advantage of online recruitment is that it allowed for a large sample of motivated individuals, including many who likely participated because they identified as working musicians or musician-academics. Our sample was also heterogeneous in terms of age and education, which tend to vary minimally when participants are recruited from undergraduate courses in introductory psychology, as in the MET comparison sample (Swaminathan et al., 2021). Substantial variance in education meant that we had two variables to represent cognitive ability: the objective test as well as self-reports of education. The status of age and its relation to cognition is more ambiguous, because some abilities, such as processing speed, start to decline relatively early in life, whereas others continue to peak until after age 40 (Hartshorne & Germine, 2015). In any event, age, education and our online measure of cognitive ability were predictive of performance on the MET. In the comparison sample, MET scores correlated positively with three different measures of cognitive ability: digit span forward, digit span backward, and Raven’s tests. Thus, as with virtually any specific cognitive ability, individual differences in musical ability vary positively with general ability (Carroll, 1993), whether they are measured in person or online.

Although the association between MET scores and cognitive abilities was consistent with previous research (e.g., Swaminathan et al., 2017, 2018, 2021; Swaminathan & Schellenberg, 2018), and strong even when other variables were held constant (Table 5), cognitive ability was a better predictor of scores on the Rhythm compared to the Melody subtest. Swaminathan et al. (2021, Table 8) also found evidence that general ability (i.e., working memory as measured by digit span backward) was a better predictor of Rhythm than of Melody scores. By contrast, music training was a better predictor of Melody compared to Rhythm in the online and in-person samples, and this difference extended to other aspects of musical expertise measured by the Gold-MSI, specifically Active Engagement, Perceptual Abilities, Singing Ability, and the General Factor. In other words, performance on the Melody subtest appears to rely more on individual differences in exposure to music, whereas performance on the Rhythm subtest is more strongly associated with nonmusical individual differences. Swaminathan et al. (2021) suggested that this result might stem from the fact that the Rhythm subtest taps into a universal feature of music, whereas performance on the Melody subtest is more strongly influenced by exposure to pitch structures that are specific to Western music. Even in early childhood, 1 year of intensive music training improves melody discrimination more than it improves rhythm discrimination (Ilari et al., 2016).

Performance on the Melody subtest but not the Rhythm subtest was also linked to a lower level of mind-wandering, although this association disappeared when other predictors of Melody scores were held constant. In one previous study (Wang et al., 2015), highly trained musicians had an enhanced ability to sustain attention during a temporal discrimination task (but not in a visual discrimination task), and this advantage remained evident when cognitive ability was held constant. The association between musical ability and mind-wandering or sustained attention could be examined in more detail in future research.

Because the Gold-MSI subscales had considerable overlap (Supplementary Table 5), the multiple regression analyses served to identify which subscales made independent contributions to predicting performance on the MET. In addition to the Music Training subscale, the Perceptual Abilities subscale was a robust predictor of Melody, Rhythm, and Total scores, and, in the case of Rhythm, even superior to Music Training. This finding is indicative of participants’ meta-cognitive awareness of their musical ability: Individual differences in self-reports of music perception skills, measured before taking the MET, correlated with musical abilities measured subsequently and objectively.

The present study also had limitations. Although we asked participants to perform the experiment in a quiet environment and to avoid distractions, Internet testing made it difficult to control for extraneous sounds or potential interruptions, which remain a major challenge for online testing in general, and for auditory research in particular. Moreover, we did not include a task to ensure that participants used headphones (Milne et al., 2020; Woods et al., 2017). Although we strongly recommended that they use them throughout the experiment, it was not possible to verify whether they did.

In sum, the online version of the MET showed good internal reliability and appropriate levels of performance. Strong associations between the accuracy on the MET and musical sophistication and training, especially for the Melody subtest, were also consistent with studies using in-person testing of MET (Swaminathan et al., 2021). Finally, as expected, scores from online administration correlated with personality (openness to experience), cognitive ability, and mind-wandering. Online testing also had advantages compared to the traditional in-lab testing, which have been noted by others (e.g., Casler et al., 2013; Gosling et al., 2004). For example, online recruitment allowed us to obtain a larger and more diverse sample compared to previous studies on musical abilities, including participants from different nationalities, a large number of professional musicians as well as nonmusicians, and participants who varied widely in age. Finally, the online format made it possible to recruit participants and collect data in a very short time (approximately 1 month), because we were not limited by the space and time constraints of the laboratory.

To conclude, our findings showed that online administration of MET is a valid and reliable alternative to traditional in-person measurement of musical abilities. With greater worldwide access to the Internet, and in-person restrictions imposed by the COVID-19 pandemic, there has been a growing interest in the development of Internet methods. This study contributes to the growing literature on the utility of online testing as an alternative, or complement, to laboratory testing for psychological research.