Learning to speak a foreign language, particularly as an adult, can be challenging and time-consuming. Some modern language teachers use songs in the classroom or employ musical and rhythmical mnemonic devices as ways to reinforce the learning of foreign language material (Anton, 1990; Felix, 1989; Murphey, 1992; Spicher & Sweeney, 2007). It has been argued that the comprehension of nonnative speakers’ speech is dependent on “quasi-musical,” prosodic features such as rhythm, stress, and intonation (Parker, 2000), and that when learning a new language, mastery of such prosodic features (as opposed to individual sounds or syllables) is especially important in order to be understood by native speakers (White & Mattys, 2007). Musical ability has also been linked with foreign language ability, such as in correlations found between Japanese adults’ musical ability and their second language (L2) speaking ability and pronunciation skills in English (Slevc & Miyake, 2006), and between participants’ length of musical training and their ability to imitate foreign language phrases (Pastuszek-Lipińska, 2008). Pitch perception has also been associated with pronunciation abilities in a second language (Posedel, Emery, Souza, & Fountain, 2012).

Classroom-based studies with children have also reported benefits for foreign language vocabulary learning when the material is presented with a melody (Medina, 1993; Murphey, 1990). An adaptation of an experimental statistical-learning paradigm developed by Saffran and colleagues (Saffran, Aslin, & Newport, 1996; Saffran, Johnson, Aslin, & Newport, 1999) and conducted with adults showed that participants (native French speakers) who heard a sound stream of sung syllables were able to identify word boundaries after only a 7-min training period, whereas participants in a monotone speech condition performed at chance levels (Schön et al., 2008). The authors interpreted these results as suggesting that, since pairing each syllable with a consistent pitch can lead to quicker word segmentation of a sound stream, songs might be particularly helpful during the beginning stages of L2 learning. However, on the whole, very little research evidence has supported the range of educational claims that have been made regarding the benefits of singing and music in foreign language learning (Sposet, 2008).

In contrast, a range of research has shown links between music and native language abilities, including a long-term memory benefit for learning verbal material through listening to songs (Calvert & Tart, 1993) and better memory and quicker relearning of a list of proper names that was initially learned through song, rather than through hearing speech (Rainey & Larsen, 2002). Classroom-based research has also highlighted the potential benefits of music for native language phonological awareness and literacy skills (Gfeller, 1983; Martin, 1983; Overy, 2003). Of course, native and foreign language learning are quite different in their processing demands, and thus musical or sung presentations may have differential effects when used in first language (L1) versus L2 learning. For example, learning song lyrics in a native language will include the automaticity of meaning processing, whereas learning song lyrics in a new language may require more processing effort and involve smaller units of “chunking” (syllables and words rather than entire phrases), particularly at the beginning stages of L2 learning. Nevertheless, some parallels between L1 and L2 processing may be affected similarly by a musical or sung presentation, such as the auditory memory and sequencing requirements of verbatim memory for words or phrases (Appel & Lantolf, 1994; Bellezza, 1981; Rainey & Larsen, 2002). We thus turn to a brief discussion of the insights that experimental research into the effects of musical presentations during native language learning can offer an experimental study of the role of singing in foreign language learning.

A series of four experiments conducted by Wallace (1994) was among the first to provide evidence in support of the idea that an appropriate musical presentation has the potential to support verbal memory on a language task. Over five listening sessions, one after the other, participants heard three verses of previously unfamiliar folk ballads, after which memory for the words was immediately tested using a written free recall task. Across a range of learning conditions, the results showed that pairing each verse with the same melody during the learning process led to the highest verbal memory performance on several different analyses of verbatim written text recall. By contrast, pairing a different melody with each individual verse of the ballad was actually less effective than hearing a spoken version. Wallace thus hypothesized that when the three verses had three different melodies, the frequently changing music served as a distraction, rather than helping participants memorize the lyrics. Wallace concluded that using songs with a repeated, simple pattern can facilitate verbatim text recall in the native language (Wallace, 1994).

One question arising from this study is the extent to which the lyrics and melodies of songs might become integrated in memory. Morrongiello and Roes (1990) found evidence to suggest that, although lyrics seem to be more salient than melodies during song listening, lyrics are recognized more easily later when they are paired with the same melody during both the encoding and recognition stages. Similarly, Thiessen and Saffran (2009) used a head-turn paradigm with infants 6.5 to 8 months of age and found significantly more recognition of both lyrics and melodies when they had previously been presented together in song (rather than presented separately). The authors interpreted this finding as being due to the multiple, related regularities of consistent pitches and words being presented together during the sung stimuli.

More recently, evidence from cognitive neuroscience has provided support for the idea that music and language are linked at the neural processing level (Besson, Schön, Moreno, Santos, & Magne, 2007; Gordon, Schön, Magne, Astésano, & Besson, 2010; Jentschke, Koelsch, & Friederici, 2005; Milovanov, Huotilainen, Välimäki, Esquef, & Tervaniemi, 2008; Patel, 2008). Several proposals have been put forward by neuroscientists in support of the idea that a musical presentation of linguistic stimuli may help particularly at the encoding stages of memory, and particularly for verbatim language tasks (Sammler et al., 2010; Thaut, Peterson, & McIntosh, 2005).

Behavioral studies have also shown verbatim recall and recognition to be facilitated by a sung presentation, as compared with a spoken version of the same materials, and especially for difficult tasks. One study showed an advantage for recognition of a sung (vs. spoken) advertising slogan paired with a product, but only for a more difficult recall task—no effect was found for an easier recognition task (Yalch, 1991). Another study revealed that both rhythm and melody were effective facilitators of verbal recall for folk song lyrics, as compared with a spoken version (Purnell-Webb & Speelman, 2008). The results from experiments comparing “verbatim” recognition for short passages of instrumental music (without words) and for poetry have also showed that verbatim recognition is better for both instrumental music (Dowling, Tillmann, & Ayers, 2002) and poetry (Tillmann & Dowling, 2007) than for prose.

By contrast, a few studies using similar experimental designs have produced contradictory evidence. After first replicating the findings of Wallace (1994), Kilgour, Jakobson, and Cuddy (2000) controlled for the presentation rate and total duration of the auditory stimuli, with results showing no memory advantage for the sung a cappella presentation of the song lyrics relative to a spoken version. The researchers did find significantly higher performance for musicians than for nonmusicians after participants had heard the stimuli twice, but not after the first listening session. Thus, controlling for the rate of presentation and duration of stimuli in the different listening conditions is clearly an important feature when testing whether music may have facilitative effects for verbal memory, particularly since a slower rate of presentation in musical conditions has previously been considered to account for any learning or memory facilitation (cf. stuttering research; e.g., Healey, Mallard, & Adams, 1976).

A key study conducted by Racette and Peretz (2007) investigated whether the facilitative effect of melody for verbal learning, which has often been assessed using written, verbatim text recall, would apply for oral recall. Native French-speaking participants listened to lines from unfamiliar French folk songs that were either sung a cappella or spoken. The results did not show a facilitative effect for the sung presentation, and surprisingly, the opposite result was found, with the spoken presentation producing better recall for the words in both the short and the long (i.e., several months) terms. The authors concluded that singing does not have a facilitative effect on verbal memory and suggested that memorizing the words of a song prior to learning the melody is likely to result in better retention. However, one possible explanation for this unexpected result is that the folk songs used by Racette and Peretz had complex, nonrepetitive melodic lines rather than the type of simple, easy-to-learn melody that Wallace (1994) concluded was facilitative for verbal memory. In addition, the stimuli were not controlled for presentation rate.

In summary, the evidence from most of the studies that have shown that music can facilitate verbal learning and memory suggests that the benefit will be greatest for verbatim recall tasks (Dowling et al., 2002; Thaut, Peterson, Sena, & McIntosh, 2008; Tillmann & Dowling, 2007; Wallace, 1994; Yalch, 1991). The benefits of a sung presentation may disappear when the rate of stimulus presentation is carefully controlled (Kilgour et al., 2000), especially if the test does not require verbatim recall (Yalch, 1991). A sung presentation may even be detrimental for verbal learning and memory when the song’s melodic and rhythmic structures are difficult to learn (Racette & Peretz, 2007; Wallace, 1994). The rate of stimulus presentation, the overall duration of stimuli, and the song’s complexity are thus important considerations that may influence verbal learning and memory through song. To date, there is no consensus regarding whether learning verbal material with a melody can provide benefits for learning and memory, whether in the native language or in a foreign language.

To our knowledge, no study has been conducted to investigate whether adults’ aural/oral learning in an unfamiliar language can be facilitated by singing during the learning process. For the present study, three different methods of presentation were developed in order to compare the relative effects of presenting and rehearsing the material in different ways: singing, speaking, and rhythmic speaking. Our participants were taught 20 phrases in an unfamiliar language, Hungarian, during a 15-min period of aural/oral “listen-and-repeat” learning procedure. On the basis of previous research findings for verbal material in the native language (Wallace, 1994; Yalch, 1991), it was predicted that the singing of paired-associate foreign language phrases would provide a memory advantage on Hungarian language tests, and that the greatest performance differences would be on verbatim, spoken language tests. Furthermore, we predicted that performance scores in the rhythmic speaking condition would fall between those in the singing and speaking conditions, since the singing condition would have the benefits of both rhythmic and melodic features to support encoding and retrieval.

There are five key differences between the present study and the Racette and Peretz (2007) experiment in which participants repeated lines of songs aloud (reciting or singing in their native language). First, the present study included a longer, 15-min learning period, with three repetitions of the entire stimulus set. Second, in the present study we used stimuli with matched rates of speaking and singing. Third, we used a variety of performance measures: verbatim recall and spoken production tasks in the new language, as well as English recall, foreign language recognition, and multiple-choice vocabulary tasks. Fourth, the verbal materials to be learned consisted of short phrases of only a few syllables rather than whole verses of song lyrics. Fifth, the stimuli were presented in an unfamiliar language and were paired with the translated phrase in the participants’ native language. If the advantages shown previously for musical presentations in native language studies were not observed in the present study, when the foreign language phrases were carefully controlled for duration and rate of presentation, this would support the claim of Kilgour, Jakobson, and Cuddy (2000) that previous experiments showing a benefit for music may have been flawed due to the use of learning materials that were not adequately controlled for duration and presentation rate. If a benefit for the singing condition were to be found in all tests, this would provide evidence that singing can support a variety of foreign language skills. If a benefit for the singing condition were observed in this study for the verbatim spoken recall tests, but not for all five Hungarian tests, this would lend support to the idea that listening to songs and singing can support verbal learning and memory, but that a significant advantage for a sung presentation may only be observed when verbatim recall measures are used.



Participants were randomly assigned to one of three learning conditions: speaking, rhythmic speaking, and singing. The participants heard 20 paired-associate phrases in English and an unfamiliar language (Hungarian) during a 15-min “listen-and-repeat” learning period, separated into three 5-min learning sessions. Participants practiced the 20 English–Hungarian paired-associate phrases one after another by first listening to the spoken English phrase, and then listening twice to the paired Hungarian phrase and repeating the Hungarian phrase aloud as best they could. The 15-min learning period was followed by a series of five different production, recall, recognition, and vocabulary tests for the English–Hungarian pairs. Measures of participants’ mood, background experience, and abilities in music and language were also administered in order to check that the randomly assigned groups were matched for these factors.

Hungarian was chosen because it was likely to be an unfamiliar language for native English-speaking participants. In addition, as compared to English (and, indeed, to the more frequently studied Germanic or Romance languages), Hungarian has different syntactic structures, few lexical cognates, and differences in the sound system. Using basic phrases in a foreign language, rather than using nonsense words that sound like possible native-language words, provided a strong test of whether singing can support foreign language learning. Importantly, the stimuli in the three learning conditions were also controlled for overall duration and presentation rate, reproducing an important feature of the experiments by Kilgour, Jakobson, and Cuddy (2000).


A group of 60 self-selecting adult students (30 male, 30 female) participated in the study. These participants were recruited through a university website advertising an auditory memory study to learn foreign language phrases. Their ages ranged from 18 to 29 years, with a mean age of 21.7 years. The 60 participants were randomly assigned to the three learning conditions, which were matched for gender (ten males and ten females in each group).Footnote 1 Analyses of variance (ANOVAs) revealed no significant differences between the three groups in age, mood, phonological working memory, language learning experience, language learning aptitude, musical experience, or musical ability (see Table 1).

Table 1 Descriptive statistics and ANOVA on the other measures in the speaking, rhythmic speaking and singing groups


The English and Hungarian stimuli were recorded by native speakers of each language in a soundproofed recording studio. Both the English and Hungarian phrase recordings were made by an experienced sound engineer using an omnidirectional microphone. Digital audio files were recorded onto a Windows computer using the SONAR 4 Studio Edition software.

The English stimuli were recorded by a native British English speaker who was given a list of 20 phrases, plus three practice phrases and was asked to say them aloud at a normal speed. She then repeated the entire list to ensure that each English phrase had a clearly articulated recording. The recorded stimuli were later split into individual sound files, one for each English phrase, and these spoken English stimuli were used during the learning process in all three conditions and as the English prompts for the Hungarian production test.

The Hungarian stimuli were recorded by a native speaker who did not have extensive training in music or singing, but who felt comfortable singing for the recording sessions. The spoken phrases were recorded first; then the speaker said the rhythmically spoken phrases in time with a metronomic pulse (72 beats per minute [bpm]) using the rhythms of the written melodies; and finally, the stimuli were sung in time with the metronome, using a pitch range from A3 to F4. Musical notation of the sung stimuli is available in the Appendix, showing the rhythmic and melodic patterns employed. The same 20 Hungarian phrases were used in all three learning conditions and ranged from two to eight syllables in length. For the spoken stimuli, the Hungarian speaker was asked to speak slowly and clearly, as if she were talking to a nonnative speaker. Since the phrases were spoken very slowly, both the stressed and nonstressed syllables were pronounced clearly. The rhythms and melodies created for use in the rhythmic speaking and singing conditions were modeled on the natural prosody of the Hungarian language and on melodies found in Hungarian folk songs. The rhythms used for the rhythmic speaking and singing conditions were identical, in that the singing condition stimuli simply included the addition of a melodic line along with the rhythmic patterns used for the rhythmic speaking condition, and both were recorded at the strict tempo of 72 bpm. The final rhythmic speaking stimuli were thus very different from the speaking stimuli, since they had a clear, metrical, musical rhythm and were spoken in time with a metronome. The initial practice trials of participants in all three conditions were listened to by the experimenter, to confirm that participants were capable of repeating back the Hungarian phrases during the learning process at a reasonable level of accuracy.

An important consideration for this study was to control for the duration and rate of presentation of the foreign language phrases in the three learning conditions, since it has previously been argued that listening to a song is only facilitative when verbal materials are presented at a slower rate than normal speech (Kilgour et al., 2000). In this study, the duration of the Hungarian phrases was carefully controlled, with the shortest, two-syllable phrases lasting 1 s each, and the longest, eight-syllable phrases lasting for 4 s. The mean and range of durations for the Hungarian stimuli were almost identical across the three learning conditions (see Table 2). An ANOVA comparing the stimulus durations (in milliseconds) across the three learning conditions showed an extremely close relationship between the phrase durations (p = .97). The English spoken stimuli were identical in all three learning conditions, with a mean duration of 1.0 s. The phrases were also presented in the same context in all three learning conditions: English Phrase 1, pause (1 s), Hungarian Phrase 1, pause (1 s), Hungarian Phrase 1, pause (8 s) for a participant to repeat the Hungarian phrase as best he or she could, followed by English Phrase 2, and so on, up to Phrase 20 (see Fig. 1 for an illustration of the learning procedure).

Table 2 Mean durations (in milliseconds) of the Hungarian stimuli in the speaking, rhythmic speaking, and singing conditions
Fig. 1
figure 1

Illustration of the English–Hungarian paired-associate phrase-learning procedure

The order of stimulus presentation in both the learning and testing phases was generated using a pseudorandom number generator based on the Mersenne Twister algorithm (Matsumoto & Nishimura, 1998). The order of presentation was then checked to ensure that a phrase with a particular word was not placed directly before or after another phrase with the same word. The final Hungarian tests were also checked to ensure that the algorithm had not placed the test phrases in the same order of presentation as those phrases had appeared in during the learning sessions.


Five Hungarian tests were developed to measure participants’ learning of the paired-associate English–Hungarian phrases, and several background measures and questionnaires were also administered to establish whether the three groups were well matched, as described below.

Multiple-choice Hungarian vocabulary test

This test consisted of 20 forced choice multiple-choice questions in which each Hungarian word was presented with four possible English meanings to choose from (chance performance was thus 25 %). This measure was used as a pretest in order to assess whether participants had any prior knowledge of individual words in Hungarian, and again as a posttest in order to test whether the same individual words could be correctly identified after the learning sessions (which involved learning complete phrases in Hungarian). A score higher than 50 % on the pretest resulted in the participant’s data being excluded, due to the possibility that the participant had basic knowledge of Hungarian prior to starting the study (four participants were removed for this reason). The same 20 multiple-choice items were used as the Hungarian vocabulary posttest after participants had finished all three learning sessions.

Hungarian production test

The participants heard the 20 English phrases from the learning sessions—presented in a different, randomized order—and attempted to recall and reproduce the equivalent Hungarian phrases as best they could. The written, on-screen instructions asked participants to say the Hungarian phrases normally (rather than using singing or rhythmic speaking). That is, although the participants in two conditions spoke rhythmically or sang during the learning phase, all participants were explicitly asked to speak normally during the test phase, and they all complied with this instruction.

English recall test

Participants heard the 20 Hungarian phrases as prompts—presented in a different, randomized order—and attempted to recall and reproduce the equivalent English phrase. For this test, participants heard the Hungarian stimuli in the same way as during the learning sessions (e.g., spoken, rhythmically spoken, or sung phrases, depending on the group to which a participant had been assigned).

Hungarian recognition test

The participants were asked to make same/different judgments for accurate and inaccurate spoken versions of the 20 Hungarian phrases they had learned—again presented in a different, randomized order. Ten of the Hungarian phrases were presented with all syllables in the correct order. In the remaining ten items, two adjacent syllables within each phrase were swapped, resulting in new, incorrect Hungarian phrases (e.g., Megismételné, kérem was changed to Megistelméné, kérem). A native English speaker created the ten new, “inaccurate” Hungarian phrases, and then the same Hungarian speaker who was recorded for the other Hungarian stimuli was audio-recorded saying these ten “inaccurate” phrases. Because the ten inaccurate Hungarian phrases still had all of the same syllables, they sounded very similar (but not identical) to the phrases that participants had heard during the learning sessions.

Delayed-recall Hungarian conversation

Participants were asked to engage in a short conversation entirely in Hungarian, 20 min after the final learning session had been completed. They were informed that they would hear a series of Hungarian phrases on an audio recording and were instructed to respond by using a Hungarian phrase that would make sense in the context. The participants were encouraged to guess or to attempt to recall and reproduce the Hungarian phrases for “I don’t know” or “I don’t understand” if they were unsure of how else to respond. The recording consisted of five simple Hungarian phrases, separated by 8-s pauses, which functioned as one side of a brief conversation.

Other measures

The participants also completed a number of additional measures and questionnaires relating to their musical and language learning abilities and experience. They reported their age and gender at the start of the experiment session. This was followed by an assessment of each participant’s phonological working memory, using the 20 low-wordlike items from the Children’s Test of Nonword Repetition (CNRep) developed by Archibald and Gathercole (2006, p. 514). Each participant also completed the 20-item self-report Positive and Negative Affect Scale (PANAS) mood questionnaire (Watson, Clark, & Tellegen, 1988) at the beginning and end of the experimental session. A brief language aptitude test (adapted from the modern language aptitude test of Gilleece, 2006) and a short questionnaire about the participant’s language-learning experience were administered, in addition to a brief musical ability test (adapted from the musical ability tests developed by Overy, Nicolson, Fawcett, & Clarke, 2003) and a short questionnaire about their musical training and experience.


The experimental sessions were held in a quiet room at a comfortable temperature and with appropriate lighting. All of the participants completed an informed consent form and were treated according to the ethical research standards published by the American Psychological Association (2002). Sessions were completed on an individual basis, with each participant taking approximately 60 min to complete all sections of the experiment. The participants were compensated £6 for their time.

Participants first completed the phonological working memory test (CNRep) by repeating each nonword after the researcher. This test was followed by the brief, presession mood questionnaire (PANAS) and the multiple-choice Hungarian vocabulary pretest, both of which were presented on a Windows desktop computer. Because the Firefox Web browser was displayed full-screen without displaying the URL, participants could neither return to a previous screen nor proceed to the next page until all of the required responses for each webpage were completed.

Before beginning the Hungarian learning sessions, participants were given spoken and written instructions that they should listen to the recording and repeat the phrases that they heard in the new language aloud, as best they could, and try to remember both the foreign phrases and the English meanings. The auditory stimuli and test items were played at a comfortable volume through noise-canceling headphones. Participants completed a practice session with three Hungarian phrases (which were never used again) while the researcher was present to answer questions. After establishing that the participant understood and was accurately repeating the practice phrases as instructed, the researcher went to a nearby room while the participant worked through the remainder of the session by following written on-screen instructions.

The 15-min learning period consisted of three 5-min aural/oral “listen-and-repeat” learning sessions. During the first learning session, the Hungarian phrases were displayed as text on the screen, as the 20 paired-associate phrases were presented. No text was displayed for the second and third 5-min learning sessions. This learning procedure gave participants time to learn and practice repeating the 20 Hungarian phrases in the complete paired-associate list three times, before performance on the material was evaluated.

At the end of the three 5-min learning sessions, participants first completed the Hungarian production test, followed by the English recall test, the Hungarian recognition test, and finally the multiple-choice Hungarian vocabulary posttest. For all tests, the participants were told that if they were not certain of the correct response, they should try to guess. They then completed the measures of language learning ability and experience and of musical ability and experience, as well as the brief mood postsession questionnaire (PANAS). Finally, participants completed the delayed-recall Hungarian conversation test.

At the end of the experimental session, the participants completed a four-item debriefing questionnaire about the study. They were also informed that Hungarian was the language that they had been learning during the experiment.

Data analysis

Digital audio recordings were made during each experimental session. Listening to the recordings confirmed that all of the participants had followed the instructions during the Hungarian learning sessions. Responses to the oral test items were analyzed by phonetically transcribing participants’ spoken utterances from the audio recordings, which were later analyzed by the experimenters without knowledge of the learning condition to which each participant had been assigned. These raw data were entered into a spreadsheet, and scores were calculated on the basis of the phonetic transcriptions. Responses to the Web-based items were collected separately via a MySQL database to reduce the need for paper tests that could introduce coding errors.

Multiple-choice Hungarian vocabulary test

Participants’ responses to these Web-based test items were scored with a correct answer receiving one point and an incorrect answer receiving zero points. A total score of 20 points was possible. Across all groups, the mean posttest performance was significantly above chance levels (p < .001), indicating that L2 learning had occurred (see the Results section).

Hungarian production test

All of the participants’ spoken utterances on this verbatim recall task were phonetically transcribed from the audio recordings. A point was only awarded if the participant produced the whole phrase in the new language correctly, with all syllables in the correct order. However, perfect pronunciation was not required; for example, the Hungarian phrase meaning “I don’t understand” ['nɛm 'ertɛm] was scored as being correct if the participant said ['nɛm 'erdɛm], and the phrase meaning “Yes, thank you” ['igɛn 'køsønøm] was scored as being correct if the participant said ['Igɛn 'køsønøm]. A total score of 20 points was possible.

English recall test

The participants’ English phrases spoken in response to the Hungarian prompts were transcribed from the audio recordings. One point was awarded if the participant produced the correct meaning of the phrase in English, for a total possible score of 20. Verbatim production of the original English phrase was not required; for example, the response “My name is Maria” was accepted as being correct for the Hungarian phrase Marja vagyok (“I am Maria”).

Hungarian recognition test

The same/different accuracy judgments for the spoken versions of the 20 Hungarian phrases were scored as being either correct (one point) or incorrect (zero points), for a total of 20 possible points. The absence of a response was also scored as zero.

Delayed-recall Hungarian conversation

Participants’ responses on this five-item test were phonetically transcribed from the audio recordings and scored out of a possible 10 points. Two points were awarded for an appropriate reply, spoken in Hungarian, to the previous statement. Responses of Nem tudom (“I don’t know”) or Nem értem (“I don’t understand”) received just one point, whereas incorrect Hungarian phrases and replies in English earned zero points.


Scores on the Hungarian tests showed no ceiling or floor effects on any of the Hungarian tests in any of the groups. For four out of the five tests, the mean score was highest in the singing group, whereas for the Hungarian recognition test, the mean scores were highest and equal in the singing and speaking groups. Across all groups, some individuals received zero points for the recall tests that required participants to speak in Hungarian (ten participants scored zero points on the Hungarian production test, and 20 participants scored zero points for the delayed-recall Hungarian conversation). Table 3 shows descriptive statistics for the learners’ performance, and Fig. 2 shows box-and-whisker plots for each of the five Hungarian tests.

Table 3 Raw Hungarian test scores for the five Hungarian tests in the speaking, rhythmic speaking, and singing conditionsa
Fig. 2
figure 2

Box-and-whisker plots for the five Hungarian tests

Two of the participants were high-performing outliers on the Hungarian production test and the English recall test, both female and both in the speaking condition, with both women’s scores falling more than 1.5 standard deviations above the group mean. The two outliers were nevertheless included in the analyses, to keep the numbers of participants in the different groups equal. Levene’s test of the homogeneity of variances was then conducted in order to investigate whether the groups were similar to one another in terms of the dispersion of the Hungarian test scores, with the results showing no significant differences in dispersion for any of the Hungarian tests (see Table 4).

Table 4 Levene’s test results for the homogeneity of scores for the five Hungarian tests

Scores on the Hungarian tests did not always show a normal distribution, with only the data from the singing condition passing Shapiro–Wilk’s test of normality for all five Hungarian tests. For participants in the rhythmic speaking condition, the spoken Hungarian tests did not pass Shapiro–Wilk’s test of normality (W = 0.89, p < .05, on the Hungarian production test, and W = 0.84, p < .01, on the delayed-recall Hungarian conversation). In the speaking condition, the same two spoken Hungarian tests deviated significantly from a normal distribution at the p < .01 level, whereas the English recall test and the Hungarian recognition test deviated at the p < .05 level.Footnote 2 However, since skewness and kurtosis values less than 2 still fall within the normal range (Field, 2000), all of the Hungarian tests were close enough to a normal distribution to permit additional statistical analyses. In addition, both ANOVA and multivariate analysis of variance (MANOVA) are robust against violations of normality and heterogeneity of variances when the sample sizes are equal (as was the case in this study), so all five tests were used to compare participants’ Hungarian performance scores in the three groups.

A MANOVAFootnote 3 comparing the participants’ scores on each of the five Hungarian tests showed a strong trend toward a difference between the groups overall, F(2, 57) = 1.80, p = .07, η p 2 = .15. As we hypothesized on the basis of previous findings showing a benefit for a sung presentation of material for verbatim verbal recall tasks (Purnell-Webb & Speelman, 2008; Rainey & Larsen, 2002; Wallace, 1994; Yalch, 1991), the MANOVA for the two spoken Hungarian tests (Hungarian production test and delayed-recall Hungarian conversation) showed a main effect of learning condition, F(2, 57) = 2.801, p < .05, η p 2 = .09, with the singing condition showing the highest performance. By contrast, the MANOVA comparing participants’ scores on the three other tests (multiple-choice vocabulary, English recall, and Hungarian recognition) did not show a difference on the basis of learning condition, F(2, 57) = 0.846, p = .54, η p 2 = .04. Separate ANOVAs revealed significant between-group differences on the spoken Hungarian tests at the p < .05 level (see Table 5 for the ANOVA results on the five Hungarian tests).

Table 5 ANOVA for Hungarian tests across the speaking, rhythmic speaking, and singing conditions

Post-hoc analyses comparing scores on the five individual Hungarian tests revealed that participants in the singing condition showed significantly higher performance on the Hungarian production test, relative to those in the speaking condition, t(38) = 2.38, p < .05, Cohen’s d = 0.75. Participants in the singing condition also showed significantly higher performance on the delayed-recall Hungarian conversation, relative to those in the rhythmic speaking condition, t(38) = 3.01, p < .01, Cohen’s d = 0.95. No significant group differences were observed for the English recall test, the Hungarian recognition test, or the multiple-choice Hungarian vocabulary test.


Our main finding was that singing was more effective as a learning condition than either speaking or rhythmic speaking when participants were required to recall and reproduce a list of short paired-associate foreign language phrases. Under controlled experimental conditions, participants in the singing condition outperformed participants in the speaking and rhythmic speaking conditions in four out of five tests. Using MANOVAs, a significant between-group performance difference was found on the two spoken, verbatim recall Hungarian tests (p < .05), whereas the differences in performance on the other tests were not found to be statistically significant.

The present results provide the first experimental evidence that singing can support L2 learning, and they support the hypothesis that the benefits of a sung presentation of verbal material in verbal learning are most evident on verbatim recall tasks. The benefits of singing found here cannot be explained by a difference in the rate of presentation of the stimuli or in their overall durations (as proposed by Kilgour et al., 2000), since these factors were carefully controlled across all three learning conditions. In addition, we observed no significant differences in age, mood, phonological working memory, or music and language experience and ability between the three groups of participants.

These findings represent an important contribution to the literature, not least because they complement and extend the findings of previous studies that have used native languages to investigate how verbal learning and memory can be supported by melody, music listening, or singing (Gfeller, 1983; Rainey & Larsen, 2002; Thaut et al., 2008; Wallace, 1994). Here we showed that a listen-and-repeat singing method using simple, previously unfamiliar melodies can provide a significant memory benefit for paired-associate foreign language learning, both immediately and after a 20-min delay.

In contrast to previous work indicating that rhythm was the most supportive element of a musical presentation method for linguistic skills and memory in the native language (Purnell-Webb & Speelman, 2008; Stahl, Kotz, Henseler, Turner, & Geyer, 2011), in the present study participants in the rhythmic speaking condition did not perform at a level similar to the one attained by those in the singing condition, or indeed perform very differently from those in the speaking condition. One possible explanation for this is the fact that a proportion of our stimuli consisted of short phrases of only two or three syllables, which provides a limited temporal structure in which to establish a sense of rhythmic pulse. However, since our interest was in memory for short phrases, the stimuli were nevertheless valid, and in the present study at least, the pitched, melodic nature of the stimuli was what seemed to drive the significant effects in the singing condition.

The success of this listen-and-repeat singing paradigm thus leads to a consideration of the possible contribution of pitch structure to verbal learning and memory. It has previously been proposed that pitch information provides an extra, musical cue (in addition to, and different from, a prosodic cue), which can support retrieval and recall (Peretz, Radeau, & Arguin, 2004; Serafine, Crowder, & Repp, 1984; Yalch, 1991). Prior work has also demonstrated the relative value of melodic pitch structure over and above rhythmic structure on musical recognition tasks, suggesting that melodic structures may have a stronger encoding distinctiveness than rhythmic structures (Hébert & Peretz, 1997). Schön and colleagues have shown that a consistent mapping of linguistic and pitched information can enhance nonsense-word learning, and have suggested that when a syllable change is accompanied by a change of pitch, this has the potential to enhance phonological discrimination (Schön et al., 2008). Since one of the first challenges in learning a new language is to segment speech sounds into individual words (Jusczyk & Aslin, 1995), a consistent melodic structure may thus be helpful in the earliest stages of both first and subsequent language learning—for example, within the simple, repetitive structure of lullabies (Schön et al., 2008). Thiessen and Saffran (2009) have suggested that “bidirectional” facilitation for lyrics and melodies may occur during infant-directed lullabies, with redundant cues potentially identifying structure in the complex input.

Yalch (1991) previously suggested that musical cues might be especially helpful as a mnemonic aid when other cues are not available, such that when a task is relatively easy (e.g., when visual or other cues are available, only recognition is required, or more repetitions are presented), the benefit of melodic cues and structure is not required. This idea corresponds with our finding that singing helped significantly with the more demanding Hungarian speaking tasks (p < .05), but not with the English recall, Hungarian recognition, or multiple-choice vocabulary tasks (n.s.). Yalch also suggested that the sung phrases found in advertising jingles might emphasize the phonetic aspects of verbal information more than the semantic aspects, thus leading to more effective verbatim recall, which corresponds with the significant results of our study on the Hungarian tests that required verbatim performance. It could be argued that an emphasis on the phonetic aspects of verbal stimuli may be particularly useful when beginning to learn a foreign language, since the semantic meaning of individual words is not always directly available on which to “hang” the utterance. As we mentioned in the introduction, previous research has also shown that recognition for surface information (or “verbatim recognition”) decays over time for prose, but improves for both music (Dowling et al., 2002) and poetry (Tillman & Dowling, 2007).

Further evidence from cognitive neuroscience suggests that the integration of lyrics and melodies occurs at an early stage of neural processing and that, even when attention is paid to the words, the melody still has an influence on input processing (Gordon et al., 2010; Sammler et al., 2010). The design and results of the present study do not allow us to identify whether the beneficial effects of singing in this listen-and-repeat, paired-associate foreign language learning paradigm were due to correlated pitch cues, integrated encoding of lyrics and melodies, or other possible factors (such as increased attention), but it seems likely that a variety of factors may have come into play, which will need to be delineated in further research.

One important aspect of this study was the fact that the learning paradigm and tests assessed both participants’ L2 understanding of and verbatim use of the new linguistic material, rather than employing a nonsense-word paradigm. It is also important to note that learning via singing showed a direct transfer to speaking skills, since all participants were tested on their spoken Hungarian skills, regardless of the learning condition. Finally, the Hungarian language provided a robust test environment for our hypothesis: This language was unfamiliar to all of the participants before the experiment and has a different sound system and syntactic structures than English or the Germanic and Romance language families, as well as very few lexical cognates with those languages.

We found some evidence that semantic learning took place in this experiment, since the Hungarian conversation task involved selecting the most appropriate L2 phrase to use in the conversational context, and the Hungarian production task involved matching an English phrase, which would have been understood semantically, with the correct L2 phrase. In addition, the multiple-choice vocabulary test involved isolating the meaning of individual Hungarian words (most of which needed to be extracted from the entire phrases used in the learning sessions), and the English recall test also involved matching the correct English phrase with L2 phrases, indicating the strong likelihood of some semantic understanding. However, further work will be needed in order to investigate the extent of semantic learning that is possible via a sung presentation; this was not the aim of the present experiment.

In summary, the present results suggest that the benefits of singing in foreign language learning are greatest for spoken recall in the new language, including after a delay of 20 min. We believe that these results are relevant for educational practice, not least because, despite educators reporting for many years that music can support word learning in the foreign language classroom (e.g., Medina, 1993), this is the first study to show a benefit for a musical presentation method using a randomized, controlled, experimental design. Such empirical evidence also opens up interesting research avenues for further experimental work, such as exploring the potential links between prosody and melody; examining the stages of encoding, rehearsal, and retrieval in more detail; exploring the neural mechanisms in more depth; and investigating the benefits of singing in a foreign language for classroom learning and for educational practice at a range of age and skill levels.