An efficient and adaptive test of auditory mental imagery

Gelding, Rebecca W.; Harrison, Peter M. C.; Silas, Sebastian; Johnson, Blake W.; Thompson, William F.; Müllensiefen, Daniel

doi:10.1007/s00426-020-01322-3

An efficient and adaptive test of auditory mental imagery

Original Article
Open access
Published: 30 April 2020

Volume 85, pages 1201–1220, (2021)
Cite this article

Download PDF

You have full access to this open access article

Psychological Research Aims and scope Submit manuscript

An efficient and adaptive test of auditory mental imagery

Download PDF

Rebecca W. Gelding ORCID: orcid.org/0000-0003-4883-8075¹,
Peter M. C. Harrison^3,4,
Sebastian Silas⁴,
Blake W. Johnson¹,
William F. Thompson² &
…
Daniel Müllensiefen⁴

4666 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

The ability to silently hear music in the mind has been argued to be fundamental to musicality. Objective measurements of this subjective imagery experience are needed if this link between imagery ability and musicality is to be investigated. However, previous tests of musical imagery either rely on self-report, rely on melodic memory, or do not cater in range of abilities. The Pitch Imagery Arrow Task (PIAT) was designed to address these shortcomings; however, it is impractically long. In this paper, we shorten the PIAT using adaptive testing and automatic item generation. We interrogate the cognitive processes underlying the PIAT through item response modelling. The result is an efficient online test of auditory mental imagery ability (adaptive Pitch Imagery Arrow Task: aPIAT) that takes 8 min to complete, is adaptive to participant’s individual ability, and so can be used to test participants with a range of musical backgrounds. Performance on the aPIAT showed positive moderate-to-strong correlations with measures of non-musical and musical working memory, self-reported musical training, and general musical sophistication. Ability on the task was best predicted by the ability to maintain and manipulate tones in mental imagery, as well as to resist perceptual biases that can lead to incorrect responses. As such, the aPIAT is the ideal tool in which to investigate the relationship between pitch imagery ability and musicality.

Online assessment of musical ability in 10 minutes: Development and validation of the Micro-PROMS

Article Open access 23 May 2023

Can musical ability be tested online?

Article 11 August 2021

Exploring the accuracy of musical tempo memory: The effects of reproduction method, reference tempo, and musical expertise

Article Open access 20 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Historically mental imagery has been understood as the representation in the mind of a sensory experience in the absence of sensory input (Kosslyn, 1980). However, more recent theories of embodied cognition suggest that such representations are not limited to the mind only, but are distributed throughout or influenced by the body (Shapiro, 2011). Although ancient philosophers such as Aristotle believed that imagination was central to thought itself (MacKisack et al., 2016), it was not until the 1970s that modern research began to explore the phenomenon of visual imagery (Kosslyn, 1973; Shepard & Metzler, 1971). Visual images can be subjected to a number of operations such as inspection, zooming, rotation, and transformation (Thagard, 2005). However, only in the 1990s was the first volume written on the study of imagery in the auditory modality (Reisberg, 1992).

Musical imagery is often considered a subset of auditory imagery and has been described as the silent mental replaying of music in one’s own mind (Halpern, 2003). However, especially for musicians, musical imagery can involve more than just the auditory modality, with individuals developing multimodal representations of music notation and feeling the body movements implied by the music (Clark, Williamon, & Aksentijevic, 2012). The ability to hear music internally has been argued to be fundamental to musical expertise (Gordon, 1989b; Seashore, 1919), and hence, the earliest application of the study of musical imagery was limited to music education, teaching young musicians to imagine a desired sound and co-ordinate their movement to enable that sound to occur (Goldsworthy, 2010). More recent research has supported this association between imagery and musical skill, showing that musical imagery supports effective ensemble playing (Keller, 2012; Keller & Appel, 2010). Other research has explored the potential benefits of auditory imagery for movement disorders such as Parkinson’s disease and stroke (Lee, Seok, Kim, Park, & Kim, 2018; Schaefer, 2017), memory disorders such as dementia (Halpern, Golden, Magdalinou, Witoonpanich, & Warren, 2015), and the control of auditory hallucinations in clinical and non-clinical populations (Kumar et al., 2014; Linden et al., 2011; Shinosaki et al., 2003). Considering such wide-ranging implications of auditory imagery, efficient and reliable tests of auditory imagery ability are urgently needed.

Development of such tests may also have theoretical implications. Edwin Gordon defined “audiation” as “the hearing of music in one’s mind when the sound is not physically present” (Gordon, 1985, p. 34). The definition then is synonymous with “musical imagery” (Zatorre, Halpern, & Bouffard, 2010), yet to Gordon, audiation was a broader concept involving seven subtypes, that encompassed the processes involved in understanding music that has just been heard, recalling music, composing, as well as performing (Gordon, 1989b). Gordon’s fourth subtype of audiation, namely “recalling familiar music silently”, is, therefore, most relevant to the present study (Gordon, 1985). Gordon theorized that audiation is the central mental facility that represents musical aptitude, and hence designed tests to measure music audiation for all ages of development from pre-schoolers to adults (Gordon, 1989a). Today, these tests continue to be used by music researchers (Burgoyne, Harris, & Hambrick, 2019; Puschmann, 2013), though most recently some have argued that the norms for children and different age groups have not been updated for 3–4 decades and may no longer be valid (Ireland, Parker, Foster, & Penhune, 2018). However, Gordon’s (1985) audiation theory is often overlooked in the current musical imagery literature. The audiation tests that were developed consist of same-different melodic discrimination tests, which have been shown to involve a range of cognitive processes (Harrison, Musil, & Müllensiefen, 2016), and, therefore, are not specific enough to address individual differences. Hence, the development of a more efficient and specific test of auditory imagery may be used to address the theoretical question of whether audiation, specifically the subtype involving auditory imagery, is a main predictor of musical aptitude.

Numerous studies have examined musical imagery abilities, with many investigations focused on their neural correlates (Cebrian & Janata, 2010; Halpern, 1992; Herholz, Halpern, & Zatorre, 2012; Herholz, Lappe, Knief, & Pantev, 2008; Leaver, Van Lare, Zielinski, Halpern, & Rauschecker, 2009; Zatorre & Halpern, 2005; Zatorre et al., 2010; Zatorre, Halpern, Perry, Meyer, & Evans, 1996). However, most studies of musical imagery have explored passive musical imagery, using paradigms requiring continuation of familiar melodies in silence (Herholz et al., 2008; Weir, Williamson, & Müllensiefen, 2015), or comparisons of pitches from lyrics of familiar songs (Aleman, Nieuwenstein, Böcker, & de Haan, 2000; Halpern, 1992). Active musical imagery, which requires manipulation and control over the imagined content, has received less attention (Halpern, 2012; Zatorre et al., 2010). Across both forms, several limitations in the study of musical imagery remain. These include lack of objective measures of performance (Kraemer, Macrae, Green, & Kelley, 2005); and inflexibility—tasks that are too easy for musicians (Janata & Paroo, 2006) and too hard for non-musicians (Zatorre et al., 2010). Other tests have used musical notation to explore musical imagery in musical experts; however, these types of tests are not readily transferable to the general population (Wolf, Kopiez, & Platz, 2018). Given pitch and rhythm are the two primary dimensions of music (Krumhansl, 2000), and imagery performance in these domains has been found to be dissociable, with temporal accuracy often worse than pitch accuracy (Janata & Paroo, 2006; Weir et al., 2015), isolating these two dimensions should be useful for understanding individual differences in musical imagery. The Pitch Imagery Arrow Task (PIAT) was designed to address the former of these dimensions (Gelding, Thompson, & Johnson, 2015); through controlling for other musical features such as rhythm, timbre, and harmony, this task provides a measure of pitch imagery ability.

The PIAT has several advantages over existing protocols for evaluating imagery. Specifically, the task (1) requires a behavioural response to objectively measure accuracy and response times of imagery performance; (2) is extremely difficult to successfully perform using cognitive strategies other than pitch imagery; (3) employs novel rather than familiar sequences of pitches that cannot be anticipated in advance; (4) employs a range of difficulties implemented in a staircase design, such that it can test imagery in participants with a wide range of musical experience. However, one of the main limitations is the time taken to complete the task (approx. 1 h). With 90 trials, the task is time-consuming and experienced as tedious by many participants. Whilst some modified versions of the PIAT have been used (Colley, Keller, & Halpern, 2018; Greenspon & Pfordresher, 2019), they have also been non-adaptive to individual ability.

One way to optimize tests of individual differences, making them more time-efficient and reliable, is through modern psychometric techniques such as item response theory (IRT) and computerized adaptive testing (CAT) (Harrison, Collins, & Müllensiefen, 2017). The main prerequisite for a PIAT version using IRT and CAT is a psychometric model that predicts the difficulty of PIAT items. The aim of the present studies was to construct and validate such a model. First, an exploratory study using the original PIAT tested 115 participants to determine the key variables that contribute to item difficulty. A cognitive model of the processes used to complete a PIAT trial was then developed on the basis of these exploratory results. Subsequently, a calibration study was conducted that systematically tested a large bank of pre-generated items and determined parameters of an explanatory IRT model. This final model serves to construct a CAT version of the PIAT, the new adaptive PIAT (aPIAT) which is both shorter and more efficient. Several studies have shown a link between working memory ability and imagery vividness (Baddeley & Andrade, 2000; Cebrian & Janata, 2010), and an overlap in brain regions responsible for short-term/working memory processes and effortful auditory imagery processes (For review, see Schaefer, 2017). Given that manipulation of auditory images relies heavily on working memory representations (Keller, 2012), and the aPIAT involves manipulation of pitch images, in Study 3, the test–retest reliability and validity of the aPIAT are assessed against a range of musical and non-musical working memory tasks.

Study 1: exploratory phase

The aim of the first study was to identify features of musical structure and aspects of trial design that contribute to item difficulty on the original PIAT and, hence, to generate an initial psychometric model of task performance on the PIAT.

Materials and methods

Participants

A total of 115 participants completed this study over three recruitment stages. Initial 40 participants (22 females) were recruited for the original PIAT study (Gelding et al., 2015). Additional 24 participants (15 females) completed an identical task as outlined in Gelding et al. (2015), to qualify for a different study. All of these participants (n = 64) completed the original version of the PIAT along with two control conditions—perception and mental arithmetic. Perception trials were identical to Imagery trials (described below), but with no arrows presented in silence. Hence, participants matched the audible probe to the last note just heard. Mental arithmetic trials required simple addition and subtraction of ongoing sums as guided by visual presentation of up/down arrows and digits. The remaining 51 participants (35 females) completed the PIAT with only imagery trials included (that is, no mental arithmetic or perception control conditions). This latter group also completed a rhythm imagery task during the experimental session either before or after the PIAT.

Materials

Pitch Imagery Arrow Task (PIAT)

An individual trial on the PIAT begins with an ascending major scale to provide a tonal context. A start note (either tonic or dominant of scale) is then presented simultaneously with the visual presentation of a dot on the screen. A variable number of up/down arrows are next displayed in random order, with each arrow accompanied by a corresponding pitch that moves up/down the scale in stepwise motion. Pitch changes always match the direction indicated by the arrows. These stimuli are followed by a continuation phase consisting of a number of silent arrows, in which participants are required to imagine the corresponding stepwise changes in pitch. Immediately after the sequence of silent arrows, a pre-probe screen appears, to give participants time to consolidate their current pitch image and prepare to hear the probe. One second later, an audible probe pitch is sounded. Participants are then required to indicate whether the probe matches the final imagined tone. When the probe is incorrect, it is always within the same key signature, so that it is not obviously wrong, and a maximum of 2 steps away from correct answer. A staircase design was used in which all participants began on the easiest difficulty and progressed to increased complexity with accurate responses (2 correct answers or 90% correct on a given stage of the task). See Gelding et al. (2015) for more details of the staircase design.

Psychometric questionnaires

As well as completing the PIAT, participants also completed two questionnaires, one to measure musical background and the other to measure auditory imagery vividness and control. First, participants in the first two recruitment stages (n = 64) completed a generic musical background survey, from which their years of active musical engagement was calculated. This was then used to calculate a Musical Experience Index (MEI) based on the percentage of life years spent actively engaged in music (i.e., years of musical engagement/age). Participants from the third recruitment stage (n = 51) completed only the Goldsmith’s Musical Sophistication Index (Gold-MSI; Müllensiefen, Gingras, Musil, & Stewart, 2014) to obtain a comprehensive profile of their musical skills and experiences. The musical training subscale of the Gold-MSI is of particular importance for the current study given the posited link between the ability to imagine music and the amount of formal musical training received (Aleman et al., 2000). Participants in this third recruitment cohort showed a good spread of musical training background with scale scores ranging from 10 to 44 (mean = 26.5, median = 27, SD = 10.46), which is similar to the distribution of musical training in the general population (median = 27 in Müllensiefen et al., 2014). To equate the two different measures of musical training, an MEI was calculated for the third recruitment cohort by taking their response to the question of years of musical training and dividing by their age. However, the Gold-MSI requires participants to tick a box for the years of musical training, and the category for the longest period of musical training is “10 + years”. Given the minimum age of participants was 18 years, this means that the maximum MEI approximated for the third recruitment cohort was 10/18 = 0.55. This was the case for 12 out of the 51 participants.

Second, all participants completed the Bucknell Auditory Imagery Scale (BAIS; Halpern, 2015). This 7-point Likert scale includes two subscales, for vividness (BAIS-V) and control (BAIS-C), both of which have 14 items each. Participants in this study showed a good range of vividness from 2.85 to 7 (mean = 5.025, median = 4.929, SD = 0.960) and a range of control scores from 3 to 7 (mean = 5.202, median = 5.286, SD = 0.964), which is similar to the distribution of Halpern (2015) who found that both BAIS-V and BAIS-C had mean scores of 5.1 and SD of 0.9.

Procedure

Presentation® software (Version 18.0, Neurobehavioral Systems, Inc., Berkeley, CA) was used to control the experiment and to record responses. Acoustic stimuli were generated from the 'Piano' instrument sound by Finale 2012 software (Makemusic Inc; Eden Prairie, MN) and exported as.wav files for use in Presentation^®.

Upon being seated in front of the computer with headphones, participants were given a sound check, whereby they could manually adjust the volume of the tones to a suitable level. They were then introduced to the task. Participants were informed that no movement or humming was allowed, to assist them with the task, but they should “as vividly as possible, imagine the tones and keep their bodies still”. An opportunity for questions was given prior to the start of the task.

The task has a fast exit in which participants who failed to successfully progress through Level 1 of the Imagery Trials on more than 3 attempts (that is, got more than 18 incorrect responses for Level 1 Imagery Trials) were excused from further trials. Fourteen participants were triaged in this way, having completed a range between 41 and 77 trials at their point of exit. These participants were deemed to have found the task too difficult or failed to understand how to complete it. At each point of failing Level 1, the participants were given the opportunity to ask questions and the requirements of the task were reiterated verbally.

Upon completion, participants were asked verbally to rate how vividly or clearly they formed the musical images during the task (1—not at all vivid; 5—very vivid). They were also asked: “What strategies did you use to complete the musical imagery task?” Verbal responses were recorded by the experimenter. Participants then completed the BAIS and musical experience or Gold-MSI questionnaires (as per Materials section).

Ethics

All participants provided written consent and all procedures were approved by the Macquarie University Human Research Ethics Committee.

Results

In a first step, correct responses of each participant were summed to characterize each individual’s performance on the PIAT. Summed scores ranged from 41.5 to 99% correct responses with a mean of 75.2% (SD = 11.7%) and a median of 75.9% (first quartile at 70% and third quartile at 82.2%). Table 1 shows the correlations between PIAT scores and demographic as well as musical background variables. There were no significant correlations between performance on the PIAT and gender or age (p values ≥ 0.62). In contrast, PIAT scores correlated substantially and significantly [all p values < 0.005 after correcting for multiple comparisons using Holm’s (1979) procedure] with all indicators of musical background.

Table 1 Correlations with performance accuracy

Full size table

In particular, the correlation with the aggregated number of years of active musical training/engagement (MEI) of r = 0.53 (p < 0.001) and the correlation with the Musical Training subscale of the Gold-MSI of r = 0.50 (p < 0.01) reflect the predicted association between musical training and musical imagery ability (Aleman et al., 2000).

In a second step, data at the level of individual trials were analysed using the packages lme4 (De Boeck et al., 2011), AICcmodavg (Mazerolle, 2017), and psyphy (Knoblauch, 2014) in the statistical computing environment R (R Core Team, 2014). These models took the form of mixed-effects logistic regressions, where the outcome variable was trial success (0 or 1). Categorical variables were dummy-coded. We used a model selection strategy based on minimising the corrected Akaike Information Criterion (AICc) as described in Long (2012); the resulting model parameters are listed in Table 2. (See Appendix 1 for the full description of all parameters used. Parameters were identified retrospectively as features of the task that could be manipulated to impact item difficulty).

Table 2 Generalized mixed-effects regression model for performance accuracy with 95% confidence intervals

Full size table

The best model (see Table 2) included random effects for participants and items, as well as 6 fixed effects for (1) Level (i.e., the number of silent arrows), (2) the probability of the probe, given the total number of arrows presented in the trial, (3) a binary variable indicating whether the probe note was identical to the start note of the audio–visual sequence, and 3 factors for the different Stages of the trial, that represent variability in start notes and number of heard tones/arrows in the set-up component of a trial (for more detail on the Level/Stage structure of the staircase design, see Gelding et al., 2015). The lower asymptote (guessing level) and the upper asymptote (ceiling level) of the model were optimized given these fixed and random effects and optimal values were identified at 0.3 (floor) and 0.95 (ceiling). Using tenfold cross-validation, the classification accuracy of the final model was 64.9% without random effects (i.e., not using model-based ability estimated from the same participants) and 71.6% with random effects (i.e., using model-based ability estimated from the same participants).

Discussion

The results of the exploratory study show that there are considerable individual differences between participants on the PIAT and that task performance is significantly correlated with musical training and self-reported ability to imagine auditory material. In addition, data modelling at the individual trial level showed that meaningful factors that affect task difficulty can be identified. Results of the model evaluation demonstrate that these factors (i.e., fixed effects) explain a sizeable proportion of model accuracy (64.9%). Including personal information (i.e., random effects of participant ability) further increases model accuracy to 71.6%. The sizable contributions of individual differences on the task suggest that it is especially suitable for computerized adaptive testing.

The largest predictor of item difficulty was the number of tones that the participant had to imagine: more tones led to higher difficulty. The second largest predictor was the proportion of other items in the item bank that shared the same probe tone (Probability_Probe): less frequent probe tones led to higher difficulty. The probe note was calculated in terms of steps away from the start note, and given the various possible arrow combinations, there was higher probability of the probe note being closer to the start note than at the extremes of the tonal pattern. Repeated exposure to the tones surrounding the start note may have made more frequent probe tones easier or may have biased the participant to expect more frequent probe tones. In addition, we found fewer correct responses for trials where the probe tone was identical to the first tone of the sequence, which suggests a perceptual bias when the start note is used as the probe. That is, for incorrect probes when the probe was the start note, participants were more likely to select it as correct and, therefore, make an error. This confound of task difficulty can be removed by ensuring that trials do not have the probe as the start note. Finally, simpler trial stages (fixed start note and less variability in number of heard tones/silent arrows in set-up) proved to be easier for participants.

Taken together, the results of the exploratory study suggest that it is a well-suited task for constructing an effective test of pitch imagery ability based on a rigorous item response model. Results of the exploratory study also help to construct a hypothetical cognitive model of task performance on the PIAT, which serves as the basis for the subsequent calibration study.

Cognitive model

To simplify a PIAT trial, improvements were made to probe and response components of the trial. The original PIAT involved a pre-probe screen to alert participants for the need to maintain the current image and prepare them to hear the probe, which occurred 1 s later (Gelding et al., 2015). In the updated PIAT trial, the pre-probe screen was removed, and instead, the final silent arrow included the word “hold” on it and was displayed for 2 s instead of 1 s. A white cross appears on the screen when the probe is sounded (see Fig. 1). The participants then answered the question “Did the final tone match the note you were imagining?”, with two buttons at the bottom of the screen (“Match” or “No Match”) to choose from.

Using the participant’s descriptions of the strategies used to do the task, as well as common sense in stepping through the thought process of completing the task, a cognitive process model was developed. The purpose of the cognitive process model was to describe the stages of processing of a PIAT trial, to consider how different variables may be related to item difficulty, and, therefore, inform the future calibration modelling (Harrison et al., 2016). The cognitive process model included the following stages: perceptual set-up, auditory imagery generation, manipulation and maintenance, similarity comparison, and decision-making (see Fig. 2).

Perceptual set-up occurs as the participant activates the tonality template for the trials from the presentation of the initial scale and start note. Next, coordinated audio–visual processing is activated through the arrows and tones being presented together during the set-up component. Generation of the first auditory image occurs when the first silent arrow is presented. Given the uncertainty of when the first silent arrow will occur, expectation for a silent arrow increases once the initial number of heard arrows reaches 3, given that all trials had at least 3 sounded arrows in the set-up component. Subsequent processing of the silent arrows guides the manipulation of the auditory image. When the arrow with “Hold” appears, participants then maintain the last imagined note in working memory. A similarity comparison is made when the probe is heard, with a participant then making the decision whether the probe matches the last note which they were imagining.

Item features that impair the imagery stages of the PIAT cognitive process model should increase item difficulty. For example, if the correct auditory image is not originally generated, then subsequent manipulations would lead to an incorrect response. Hence, if participants fail to complete Level 1 items correctly, this suggests a lack of ability in generating a correct auditory image. Errors can also occur during manipulation, if participants are not paying full attention to the silent arrows (and lose one or more steps), if manipulations are performed incorrectly with more than a single step taken with each arrow, or if their imagery strength diminishes over the trial, leading to an impoverished or incorrect image being maintained during the pre-probe period. These types of errors are more likely at higher levels. In such cases, participants may use the information still available to them to complete the task, some of which may cause biases in responses. For example, memory for important notes from the heard sequence (e.g., the tonic or fifth of the scale presented or indeed any note contained in sequence) may bias participants to respond as “match” if imagery for the last note is not strong enough to compare to the probe, and the probe instead matches an important note from the sequence (Deutsch, 1970, 1972). This bias would increase accuracy for correct probe trials but results in errors for incorrect probe trials. Having several steps in one direction within a trial may also increase item difficulty as the correct probe would be further away from the last note heard (hence, items with a larger distance between last heard note and probe would be more difficult).

Other information available to participants if they lose their imagery may be the approximate direction of the probe relative to the last note heard, which could be tracked through counting arrows. If the direction of the probe relative to the last note heard is consistent with the direction of the arrow count (i.e., if the probe is above last heard note, and arrow count is positive), then incorrect probe trials will be more difficult to detect, leading to increased errors. Conversely, if the direction of the probe relative to the last note heard is inconsistent with the direction of the arrow count (i.e., if probe is above last heard note, and arrow count is negative), then incorrect probe trials would be much easier to detect. Incorrect probe trials should also be more difficult if the probe is 1 step rather than 2 steps away from the true imagined note, as the further away the probe is to the true imagined note, the more obviously wrong it will be. The final information participants may also be using in lieu of accurate imagery representations is implicit probe probability approximations, to decide on the likelihood of a given probe being correct, either based from the last note heard or the start note or the total number of arrows in the trials.

Once the probe is sounded, participants compare their imagined note with the probe and must decide whether it is correct. If the imagined probe matches the sounded probe, then a correct decision is straightforward. If it does not match, participants consider their confidence in their imagined note, and the other information at hand, to determine whether to select “no match” or whether they have made an error in their imagery and should instead respond as “match”. Confidence in a response should be highest when the true imagined note matches the last note heard, or when true imagined note is tonic or dominant. Hence, this cognitive model suggests that any explanatory model of data collected from the PIAT should consider correct probe trials and incorrect probe trials separately, and that there are many variables that can be extracted from a trial that could potentially predict item difficulty. These variables have been listed and described in Appendix 1 and were derived as any features of an item/trial that could be quantified, that may contribute to item difficulty. Whilst confidence was not measured as part of the PIAT response, future studies could explore continuous confidence ratings along with binary “match” and “no match” responses.

Study 2: calibration phase

As a result of the exploratory phase and the development of the cognitive model, several changes were made to the PIAT and a calibration study was conducted. The aim of the calibration study was to explore how item difficulty relates to the different features of a new set of experimental stimuli (N = 3000 items). In this new set, the stimuli systematically vary on predictors identified as important in the exploratory phase. The output of the calibration phase is an improved explanatory model that can form the basis for the adaptive version of the PIAT (aPIAT).