Introduction

Pronunciation is a much more important and pervasive feature of communication than is generally recognized. It is the crucial starting point for all spoken language, since thoughts must be articulated in sound in order to be heard and so to become a message that can be communicated to another person. Pronunciation is required not merely for talking, but for communicating and making sense to another person, that is, for making meaning in both an audible and an understandable form. A person’s pronunciation ensures the clarity required for a listener to be able to pick out words from the stream of speech and put them together in meaningful, comprehensible patterns, and also projects information about the speaker and the context of communication that makes a certain impression and establishes the common ground between speaker and listener that is needed for effective communication. In both of these aspects, pronunciation is the foundation of messaging in speech—through articulating words and their combinations in grammatical and discourse units and through projecting multiple facets of social and contextual meaning.

Research into pronunciation in real-world contexts, which today incorporate people’s transglobal movements and interactions, is making its centrality and multiple functions in communication increasingly clear. A growing body of research demonstrates that pronunciation is an aspect of language and communication which demands attention in educational and workplace contexts where speakers who have different mother tongues seek to communicate in a common language, which in the world today is often English. The emphasis of this book is on pronunciation practice and research focused on teaching, learning, and using English in these real-world contexts of transglobal and international communication.

In this chapter, we take an in-depth look at the nature of pronunciation as a component of language and communication, in its many aspects as both production and perception of speech, and in its many functions for conveying meaning of different types. We begin by differentiating the terms and disciplines that are associated with the study of speech sounds, in order to make clear to readers our own references to pronunciation in this book. Next, we review the features of pronunciation and the different types of linguistic and social meaning expressed, first by the pronunciation of individual sounds and then by the pronunciation of stretches of speech. In that part of the discussion, we give many examples of the kinds of meaning conveyed by pronunciation and how misunderstanding may result from unclear pronunciation or different conventions for pronunciation and the interpretation of speech in different speech communities. That review is followed by a consideration of pronunciation as a feature of group and individual identity. The chapter also provides a review of key concepts as they are used in the different areas of pronunciation research and practice included in this book. By reviewing the multifaceted nature of pronunciation as a pervasive dimension of communication and introducing key terms and concepts for talking about pronunciation in its many manifestations, this introductory chapter lays the foundation for the remainder of the book.

The Nature of Pronunciation and Why It Is Important

A First Look at Phonology and Pronunciation

In linguistics, phonology refers to the sound system of a language, that is, the distinctions in sounds that are meaningful for that language, or to the sound stratum or level of language, as distinct from the other “higher” strata (e.g., of lexis and syntax) of language. Phonology can be thought of as the surface level, or the building blocks, of a language. All of the spoken units of a language, from syllables up to whole discourses, are expressed through or composed of speech sounds, segmental features or phonemes (consonants and vowels) and suprasegmental features or prosodies (properties of stretches of speech). Phonology is therefore one of the aspects that can be described or analyzed about a language and its individual elements (words, phrases, clauses, sentences, and discourses such as conversations or speeches). It is also one of the aspects of speech that can be described or analyzed with respect to individual speakers or groups of speakers.

Phonology comprises the meaningful units of sound out of which all spoken language is formed and connected, by convention, to meanings that human beings recognize and respond to—both internally, in terms of their thoughts and feelings, and externally, in terms of their interactive moves. Phonology can therefore be viewed as having both psychological and social dimensions. Phonology also has a cognitive dimension, since the articulatory, auditory, psychological, and social patterning of spoken language is imprinted in specific neural pathways. The brain is then able to control and integrate all aspects of phonological performance, both subconsciously and consciously, to ensure that speech is produced with a high degree of understandability according to the speaker’s intention.

Pronunciation is a prominent term among a number of different terms used within the realm of phonology and the various types of research and practice connected to the sound stratum of language. Although phonology is sometimes used as a cover term for all of the phenomena related to linguistic sound, it is often restricted to the description or the study of meaningful distinctions in sound of a language, and on this basis differentiated from phonetics, which refers to the description or the study of the details of language sounds. Linguists regularly use these two terms with this contrast in mind, phonology to refer to the system and units of linguistic sound that are meaningful for a language and phonetics to refer to the physical properties of those units. The emphasis of theoretical linguists on theoretical phonology (or in some cases, theoretical phonetics) can be contrasted with the practical applications of applied linguists, which can be referred to as applied phonology (or in some cases, applied phonetics). The term pronunciation tends to have a practical or applied emphasis and so is generally not used by theoretical linguists and researchers in second language acquisition (SLA), who typically refer to phonology (or occasionally phonetics) as their area of study. Language teachers generally use the term pronunciation, referring to an area of proficiency in language learning or a type of skill in spoken language performance, rather than phonology.

Researchers and practitioners with a practical or applied emphasis may use any of these terms (phonology, pronunciation, or phonetics) together with others, such as articulation, relating to the mechanics of producing speech sounds (e.g., speech therapists), or accent, relating to the general characteristics of speech that are associated with a certain geographical locale or social group (e.g., managers and trainers in business). Social psychologists may refer to pronunciation or accent as a focus of investigation on people’s attitudes to specific languages or speaker groups. Because we aim to focus on the practical aspects of phonology, we will refer to pronunciation for the most part, while using the other terms as appropriate for our coverage of research and practice in the various disciplines and areas of spoken language performance included in this book.

As a type of linguistic skill or language proficiency, pronunciation involves learning to articulate and discriminate the individual sound elements or phonemes making up the system of consonants and vowels of a language, sometimes referred to as segmental phonology, and the features of connected speech making up its prosody or prosodic system, sometimes referred to as suprasegmental phonology. The prosodic system or suprasegmental phonology includes, at a minimum, tone and intonation (defined by pitch), rhythm (defined by duration), and stress or accentuation (defined by acoustic intensity, force of articulation, or perceptual prominence). From the perspective of language teaching, prosody may also include articulatory (or vocal) setting, a complex of specific postures of the vocal organs (lips, tongue, jaw, and vocal folds), and/or voice quality, the vocal characteristics resulting from such settings, that are associated with different languages and pragmatic meanings.

Phonemes are key to the makeup of words and their component parts—syllables, the allowable individual phonemes and phoneme combinations that can carry stress (e.g., /a/ alone but not /b/ alone; vowel [V] and consonant [C] in combination, /ba/ [C + V] and /ab/ [V + C]; and the vowel flanked by consonants /bab/ [C + V + C] and //blabz/ / [CC + V + CC]). Individual phonemes differentiate rhyming pairs (e.g., lap and cap, up and cup, seek and peak) as well as all kinds of minimal pairs—pairs of words that differ in meaning based on a difference in one phoneme (e.g., cab and cap, cup and cap, clap and cap, pick and peek). Prosody comes into play when individual consonants and vowels are joined together to make syllables, as the components of the meaning-units (morphemes) composing words, which are the building blocks of phrases and all longer grammatical units and stretches of speech. Patterns of rhythm, stress/accentuation, tone, and intonation delimit the structure and meaning of words and larger units.

Intonation is sometimes referred to as speech melody or, informally, the “tunes” of language. Traditionally, American linguistics has made a distinction between tone as referring to word-level pitch patterns and intonation as referring to sentence-level or utterance-level pitch patterns (and often incorporating stress patterns as well) that is not made in British linguistics, where tone is a component of intonation (e.g., Halliday & Greaves, 2008). In this book, we will sometimes use tone to refer to pitch patterns or contours that function above the word level, reflecting the British tradition followed in some studies. As in the case of other terms connected to pronunciation teaching and research, we seek to avoid terminological confusion and overload while also aiming to accurately represent the way that terms are currently being used.

The sound system of each language is unique, built on specific distinctions in phonemes and prosodic features. Languages differ in the size of their phoneme inventories as well as in the specific phonetic features that differentiate individual consonants, vowels, and prosodic patterns and their cue value, that is, the relative importance of specific phonetic features. While some languages have a small inventory of vowels (e.g., Hawaiian, Serbo-Croatian) or consonants (e.g., Cantonese, Japanese), others have a large inventory of vowels (e.g., Danish, English, Finnish) or consonants (e.g., Hindustani, Lithuanian). All languages have distinctive patterns of rhythm and intonation within their grammatical units, but languages differ in the prosodic basis of lexical (word-level) distinctions and patterning. While in some languages (so-called tone languages) tone (pitch levels or contours) is a defining feature of individual words and word combinations (e.g., Hausa, Thai), in others, stress is a defining feature at the word level (e.g., Arabic, English). The consonant and vowel phonemes and prosodic patterns of individual languages, the specific phonetic features of their phonemes and prosodies, and the cue value of the individual features will overlap but also differ to a greater or lesser degree. The areas of overlap in phoneme inventories and prosodic characteristics across languages provide a starting point for language learning yet at the same time can lead a learner to give insufficient attention to differences (see Chap. 2).

Figure 1.1 gives an overview of the many dimensions in which pronunciation functions in language and communication. As a multi-level and multi-dimensional phenomenon (see Fig. 1.1), pronunciation assumes great importance in communication: it is a major aspect of understanding and interpreting spoken language and speakers’ intentions. Pronunciation is important not only for clarity of message and denotative meaning (the type of meaning conveyed in dictionary definitions of words), but also for subtleties of message meaning and connotation (the type of meaning conveyed by the associations of words in their contexts of use) and in conveying a certain impression of the speaker. Viewed as a communicational resource, pronunciation is a key aspect of communicative competence that goes far beyond being understood in the sense of speaking in such a way that the audience is able to recognize the words being spoken (i.e., intelligibility): it incorporates being understood in the broader sense of speaking in such a way that the audience is able to interpret many things about the speaker’s nature and orientation. Pronunciation is a cue to the speaker’s origin, social background, personal and communal identity, attitudes, and motivations in speaking, as well as the role(s) and position(s) which the speaker is enacting in a specific communicative context.

Fig. 1.1
A page with the text describing the dimensions of pronunciation. Focal linguistic units and boundaries, focal information units and boundaries, different types of information, nature, nurture, and situational positioning are the headlines.

Dimensions of pronunciation

Pronunciation is an important aspect of spoken language proficiency that includes speakers’ strategic competence:

Strategic competence is the way speakers use communicative resources to achieve their communicative goals, within the constraints of their knowledge and of the situation in which communication takes place. [In all communication], pronunciation has pragmatic effects because of its function in the affective framing of utterances and in defining social and individual identity. Phonological competence has strategic value in terms of a speaker’s ability to relate to and express affiliation with others in a particular social group or geographical area. It has value in terms of academic opportunity and other kinds of opportunities that might be open to a speaker who has a certain type of pronunciation or who has mastery of a range of varieties or styles. It also has value on the job and the job market in terms of being able to communicate competently with specific types of customers, in terms of the image the speaker conveys and the employer wants to promote, and in terms of the geographical range of customers that can be effectively served…. (Pennington, 2015, p. 164)

In these many different ways, pronunciation is a social and expressive resource that can be used in conjunction with other linguistic resources to convey many different kinds of meaning. The wider value of pronunciation and its application across many aspects of language and communication is a central concern of this book.

Phonology as Key to Understanding in Communication

People interpret speech within the whole context of utterance, which includes not only the physical and situational features of the setting in which an utterance occurs, but also the background knowledge and assumptions people bring to the setting of communication. The context includes many types of background knowledge such as the participants’ linguistic competence and cultural background, their knowledge and assumptions about people as individuals and as types, about the communication process in specific situations and, in general, about the world and how it functions. Differences in participants’ observable characteristics, such as their mode of talk, can have either a positive or a negative impact on communication—as can any differences in purposes, preferences, and values that participants construe as relevant to the conduct and interpretation of talk.

Since pronunciation is a main factor in participants’ identification of differences in background, perception of each other, and construal of the speech event, it has a major impact on interactive dynamics and the creation and interpretation of meaning. As a general rule, people process speech by first attending to global features that allow them to form initial impressions. These first impressions help to guide the process of interpretation by cueing the speaker’s

  • Affective state and attitude: compare Thanks a lot spoken with high pitch (suggests pleasure, sincere thanks) vs. low pitch (suggests displeasure, sarcasm);

  • Background knowledge and assumptions: compare the tag in My son Ben’s a good boy, isn’t he spoken with rising tone (suggests asking to know) vs. falling tone (suggests seeking agreement).

In addition, global properties of speech in the way of prosodic information help listeners identify the structure of the utterance and locate linguistic units within that structure: compare no one has spoken on one intonation contour with linking across the three words (no one/has) vs. with two intonation contours and a break after the first word (no/one has).

A person’s pronunciation in all its aspects—including the articulation of specific phonemes, words, and phrases as well as the prosodies of connected speech—is an important aspect of being understood as one intends. Pronunciation is first of all a crucial determinant of whether a person can be understood at all. Each language and language variety (or dialect) of a language has different pronunciation features which must be mastered in order to be understood by those who speak that language or variety. A certain “threshold level” of pronunciation clarity or accuracy (Hinofotis & Bailey, 1980), according to the norms and experience of the audience determining what is understandable to them, is required for communication to take place. This threshold level of skill in pronunciation depends on achieving a basic knowledge of the sound pattern of the language and ability to perceive its phonemic and prosodic elements and distinctions, together with a certain level of skill and automaticity in the mechanics of articulation required to produce those elements and distinctions.

With the goal of maximizing meaningfulness and coherence, speakers generally supply multiple cues to meaning in the way of the particular words, expressions, and grammatical patterns they select and in the way of prosodic and segmental features of their speech. Such multiple cues offer a degree of redundancy that can aid a listener’s processing and understanding of spoken language. However, a language learner’s limited knowledge of the L2 reduces the options for supplying multiple and redundant cues to meaning, and a learner’s limited automaticity of production limits the ability to balance different aspects of utterance production simultaneously.

Segmental Level

Inaccurate pronunciation of individual vowels or consonants can sometimes be compensated by other message elements and cues in the surrounding context, but it can cause real problems in communication in some situations. For example, pronunciation confusions or lack of differentiation by international medical graduates (IMGs), such as between the words breathing and bleeding (Wilner, 2007, p. 14), are critical to patient health and might in some cases be matters of life and death (Labov & Hanau, 2011). Although not all miscommunication is so serious, as in the constructed example of Fig. 1.2, a lack of differentiation between one phoneme and another can easily interfere with understanding and can also lead to impression formation and triggering of stereotypes that may have other kinds of impacts on communication (as discussed further below).

Fig. 1.2
A page with the text contains the dialogue between the secretary and Mr. Stevens to set up an appointment.

Not free at three

As this constructed example indicates, segmental mispronunciation or misperception may interfere with understanding and communicative purpose to a greater or lesser degree. In addition to potentially causing misunderstanding and miscommunication, segmental errors, substitutions, and nonstandard pronunciation can cause listeners to become distracted from the content of speech and focused on its form, in some cases, resulting in annoyance (e.g., Fayer & Krasinski, 1987) and/or “switching off” and avoiding further contact with a speaker (Singleton, 1995).

Beyond making it possible to understand what someone is saying, the way individual vowels and consonants are pronounced gives listeners useful information in the way of cues—often unintentionally but sometimes deliberately—to the speaker’s background. Thus, a person who says the first vowel in chocolate and coffee in a certain way, as [uɔ], cues possible origin or residence in New York City or nearby areas of New York State and New Jersey. As another example, a person’s pronunciation of the t in water as a glottal stop [ʔ] cues origin or residence in Britain, whereas pronunciation of the t in water as a flap [ɾ] cues origin or residence in the United States—though some young Americans are starting to have glottal stop in water and other words with t in medial (middle) position. People acquire different features of pronunciation depending on where they live and their age because of the specific groups of people they associate and identify with. People may also intentionally adopt features of pronunciation in order to express their social identification or affiliation with speaker groups.

Besides cueing where a person is from, the way the person pronounces individual sounds or words can also be indicative of other characteristics, such as level of education or social status. A well-known example of the connection to social status is one reported by Labov (1966), who researched the pronunciation of /r/ after a vowel (postvocalic /r/) in three New York City department stores: Saks 5th Avenue (a high prestige, high-price store), Macy’s (a mid-level store in terms of prestige and price), and Klein’s (low-prestige, low-price). He expected the sales clerks in those stores to differ in social status according to the type of store where they worked and also assumed that this difference would be reflected in their pronunciation of postvocalic /r/ as rhotic (with the /r/ articulated) or non-rhotic (with only a vowel articulation), which has been found to vary significantly by region and social class.

For example, postvocalic /r/ has a strongly rhotic pronunciation in much of the United States, though upper and middle class speakers in some coastal areas (e.g., Boston, Charleston, and Savannah), especially older speakers who have long roots in those areas, tend to pronounce words spelled with /r/ after a vowel in a non-rhotic way, lengthening the vowel (and sometimes altering its quality as well). The sentence, Park your car in Harvard yard, with all the ar words pronounced [aː] or [æː], is often given to illustrate this usage and its geographical and social associations. As another contrast, whereas accents in Scotland and Northern England are generally rhotic, accents in Southeast England are generally non-rhotic, and many Australian, Asian, African, and Caribbean varieties of English tend to be non-rhotic. In England, but not in Australia or other parts of the world, the non-rhotic pronunciation of postvocalic /r/ is historically associated with upper and middle class speech. In such cases, the non-rhotic pronunciation has a certain prestige. Where there are social class differences in use of one or another variant pronunciations of a phoneme, it is often found that people tend to employ the variant used by those of higher socioeconomic status in careful speech and that used by lower-middle class or working class speakers in less careful, spontaneous speech or casual speech.

Such differences in the regional and social significance of different pronunciations of postvocalic /r/ formed the backdrop of Labov’s (1966) New York City department store study. Labov asked the store clerks where a certain item could be found that he knew was on the fourth floor, to try to get them to say fourth floor, in order to see if they pronounced the postvocalic /r/ in those words in a rhotic or non-rhotic way. Then he pretended not to have heard them and asked them to repeat what they had just said, as a way to elicit a more careful speech style. He found that the clerks were less likely to pronounce /r/ in the rhotic way the first time, when they were not paying attention to their speech, whereas they were more likely to give a rhotic pronunciation the second time, in careful speech. This was especially true for the final /r/ in floor. In addition, he found that rhotic /r/ was more likely the higher one went up the social scale, so that the Macy’s clerks were more likely to have this pronunciation than the Klein’s clerks, and the Saks clerks more likely than the Macy’s clerks. Thus, Labov confirmed that in New York City, a person’s pattern of behavior involving the pronunciation of /r/ was a linguistic cue or linguistic marker for the person’s social class and also for whether the person was speaking in a casual speech style or a more careful speech style in which attention was focused specifically on clarity, that is, on pronunciation.

For speakers of a second language (L2), pronunciation gives an impression of their language competence, and may also give a generally positive or negative impression of them in other ways. Sometimes, by mastering what is considered a difficult sound in another language—such as, for an English speaker, the Germanch or the French or Spanishr—an L2 speaker can receive a positive impression from first language (L1) speakers of those languages. This may mean that L1 speakers of those languages might be prepared to spend more attention, time, and effort in communicating with that L2 speaker, thus aiding in the learner’s process of acquiring the language and potentially making good social or professional connections as well.

Paying attention to details of pronunciation and learning to imitate L1 speakers well can pay off. One of the authors of this book (Martha) had this experience in learning Turkish, particularly in relation to words spelled with e (as in the word for “I” ben) and r (as in the word for “one” or “a” bir). She noticed that Turkish ben, although spelled the same as the English name Ben and pronounced that way by the other English speakers in her class, was pronounced by her L1 Turkish tutor, a graduate student from Ankara, with a vowel that was closer to the English word ban, involving a lower tongue position and more open jaw and mouth than for English Ben. She also noticed that the typical English pronunciation of Turkish bir, which was pretty much the same as the English word beer, had the vowel approximately right but not the final consonant, which was quite breathy and sounded like an rr trill (as in Spanish perro “dog” or burro “donkey”), but whispered. Once she noticed how Turkish e and final r differed from English e and r, she tried to imitate the Turkish pronunciations of ben and bir, both very common words, every time she spoke. She soon found her Turkish teacher and tutor, as well as Turkish students in her EFL classes, commenting on how good her Turkish was, even though she was only a beginner! This positive response motivated her to keep at her Turkish study.

L1 speakers often think that the L2 speaker who has mastered certain features of pronunciation is a better speaker of their language than may in fact be the case. Although this positive perception can cause problems when limitations in the speaker’s L2 competence are revealed in communication, it is also an advantage in that L1 speakers are more likely to interact with an L2 speaker whom they think is a competent communicator. Thus, paying attention to pronunciation can have a significant communicative payoff that aids language learning.

Prosodic Level

Beyond the basic ability to perceive and produce phonemes and combinations of these to achieve a required threshold of intelligible speech, speakers are able to convey many other aspects of the meaning of a message in whole or in part through pronunciation, as summarized in Fig. 1.1. This includes prosodic features signifying the grouping, continuity, and focusing of information (e.g., which elements cohere or show discontinuities;, what is the relative importance of elements) and the communicative function of a linguistic unit in terms of its grammatical status and pragmatic meaning (e.g., whether it is intended as a query, an assertion, or a demand; whether it is to be taken seriously or in jest). Prosody is also an important indicator of a speaker’s attitude towards the audience, and may even determine whether a listener will give the attention and effort needed to receive and interpret the speaker’s message. In these different ways, prosody contributes to a speaker’s ability to convey and a listener’s ability to comprehend meaning and intention. In Hallidayan terms, prosody, and specifically tone and intonation, can express textual (context-related) meaning, ideational (logical sequence) meaning, and interpersonal (social) meaning (Halliday & Greaves, 2008).

If the prosodic features of speech diverge from what a listener expects in a particular context, there can be misunderstanding, sometimes with serious consequences. For example, incorrect stress on numbers can cause misunderstanding between an air traffic controller and a pilot over whether the wind speed at ground level for take-off or landing is gusting to fifteen or fifty miles per hour: with stress on fif-, fifteen may easily be heard as fifty. Wilner (2007, p. 14) gives the example of a doctor’s ability to clearly differentiate in pronunciation between 15 mg versus 50 mg as potentially critical to patient health. Somewhat less serious but nonetheless consequential in terms of misunderstanding and potential lost sales is the following example given by Tomalin (2010) of a transaction between a Filipino call center customer service representative (CSR) and a U.K. caller:

Customer: How much is the ticket?

Representative: FOURteen pounds.

Customer: FORTy pounds! That’s too expensive. (p. 175)

Some of the problem in such cases may be the speaker’s failure to articulate a final nasal in the –teen words, since native speakers will often shift stress in those words to initial position if the next word is stressed, as in ˈfourteen ˈpounds (Mompean, 2014), and yet may still be correctly understood to say fourteen and not forty.

Sometimes, with unfamiliar or unexpected prosody, there is no understanding at all. As the authors found when they were both living and working in Hong Kong, getting word tones wrong in speaking a tone language like Cantonese, in which minimal pairs often involve a difference in only a word’s pitch contour, will usually result in complete communication breakdown. An example for English prosody is that of an EFL student studying in the United States who told the story in class of going to the supermarket and asking the cashier, “Where is the [ˌlɛˈtuːs]?” (meaning to say lettuce). After being asked this question multiple times, the cashier became frustrated and refused to give the student any more of her time and attention, turning back to the other customers in line and telling the student he would just have to learn English so people would be able to understand him. When the prosodics are wrong, sometimes a listener is put off or just gives up. This is an example of the larger point that poor, incorrect, or nonstandard pronunciation can cause listeners to become annoyed and distracted from the speaker’s message (Fayer & Krasinski, 1987), even to “switch off” and refuse to interact further with a speaker (Singleton, 1995).

On the other hand, an L2 speaker can often make up for limited knowledge of English by using prosody well. For example, it is possible for L2 speakers of French to significantly improve the response they will get from Parisians by adjusting their prosody on the universal greetings of (to a woman) Bonjour, madame or (to a man) Bonjour, monsieur. The prosody in question draws attention to the address term (madame or monsieur) through a large pitch contrast between the second syllable of bonjour and the address term (madame or monsieur), high pitch on the address term, and lengthening of the vowel in the final syllable. The highlighting of the word denoting the person addressed and the high pitch on that word and especially on the final syllable can be interpreted as a show of interest and politeness.

Two words, Bonjour madame and Bonjour monsieur and its phoneme.

This is a type of prosody that can be considered empathic or exclamatory—the prosodic equivalent of Bonjour, madame! or Bonjour, monsieur!—and that also carries for Parisians a meaning beyond that of ‘Hello, madame/monsieur’ to include the sense of ‘Happy to see you!’ As a different kind of example, in Hong Kong, L2 speakers of Cantonese often find that when the tone pattern is right, L1 Cantonese speakers can understand even when the individual phonemes are not pronounced accurately. L1 Cantonese speakers also tend to respond more favorably to L2-accented Cantonese when the speaker has relatively good tones.

Miscommunication based on intonation can be serious in terms of the degree of misunderstanding and the inferences people might make from how something is said. Gumperz (1982) reported a clash at Heathrow International Airport in the 1970s between baggage handlers and recently hired Indian and Pakistani women cafeteria-line servers, who the baggage handlers said were treating them rudely. The newly hired cafeteria workers in turn felt that the baggage handlers were discriminating against them. Gumperz recorded and then analyzed interactions between cafeteria workers, both the newly hired Indian and Pakistani women and the older British women working on the cafeteria line, and their customers. He found a prosodic feature that differentiated the two groups of cafeteria workers that he claimed could be related to the bad feelings between the baggage handlers and the new cafeteria workers. He discovered that when customers came to the point in the cafeteria line where they had the option of gravy, the British servers would say the word gravy with a rising tone, in the conventionalized way of offering someone food, through a question signifying “Would you like some gravy?” In contrast, the Indian and Pakistani servers would say gravy with a falling tone, which came across to the baggage handlers as abrupt or surly, signifying not a politely voiced offer but more like an inappropriate command of “This is gravy, take it or leave it.”

A falling tone, which Brazil (1997) labeled a “proclaiming” intonation pattern, is conventionally employed in many varieties of English as a means of asserting something, whereas a rising tone, which Brazil (1997) labeled a “referring” intonation pattern, is a conventional means of suggesting or questioning rather than asserting. A person who uses falling intonation may be perceived not merely as making a proclamation or assertion, but also as assuming a position of controlling the discourse or the audience, whereas a person who uses rising intonation might be perceived as giving over control, or sharing control, of talk with the audience. These different positionings of the speaker by intonation will be perceived as appropriate and effective, or inappropriate and ineffective, depending on circumstances (Pennington, 2018b, 2018c).

As Cameron (2001) points out, when the Heathrow servers said gravy with a falling tone,

…it sounded like an assertion: ‘this is gravy’ or ‘I’m giving you gravy’—which seemed rude and unnecessary, since the customers could see for themselves what it was and decide for themselves if they wanted any…. But in Indian varieties of English, falling intonation has the same meaning as rising intonation in British varieties—in other words, there is a systematic difference in the conventions used by the two groups for indicating the status of an utterance as an offer. Since neither group was aware of that difference, the result is a case of misunderstanding. (p. 109)

Tannen (2014, p. 360) refers to this type of misunderstanding as a failure to understand the metamessage, “how you mean what you say” (p. 358) that is conveyed by intonation in its role of suggesting the context in which the message is to be understood. This is the important role played by intonation as what Gumperz (1982, p. 131) labelled a contextualization cue, a feature or set of features of message form intended by the speaker to guide a listener to a full understanding of message function, as a certain interpretation of the words used and their import in relation to context.

In the context in which Gumperz made his recordings, a server’s rising tone on the word gravy would likely be interpreted by a British English audience, or addressee, as a contextualization cue signifying a metamessage of polite helpfulness and friendliness, indicating that the server was reaching out to the customer in offering gravy, and thus being customer-oriented, whereas a falling tone would not cue this kind of metamessage to British English speakers. Rather, it might be interpreted—especially in an intercultural encounter, where stereotyping can also come into play—as not showing an orientation to the customer, projecting unfriendliness and unhelpfulness, even hostility. Although it is likely that there are other contributing factors to the baggage handlers’ perception of being treated rudely by the Asian cafeteria workers, not using the prosody which is customary and which the audience expects makes it harder to convey not only the intended meaning (the message), but also the politeness and helpfulness (the metamessage) that is conventional and so expected in dealing with customers in this and other similar contexts. By playing the recordings for the airport workers and pointing out the differences in tone and what each can signify, Gumperz helped the Asian and non-Asian employee groups see that they were working with different conventions regarding use of intonation as a contextualization cue and so to achieve some mutual understanding.

In this connection, Cruttenden (2014, p. 335) says that North Germans’ tendency to use downward pitch glides (i.e., falling tones) can sound aggressive to English speakers, such as speakers of General British English (GB), who use rising tones more and falling tones less. An essentially converse example is that in which statements ending in rising pitch (high rising tone, HRT) are interpreted to be questions, though they are not intended as such, or to be cues to the speaker’s lack of conviction or insecurity in communication, though the speaker in fact neither lacks conviction nor is insecure in communicating. The phenomenon of using HRT in statements—so-called “Upspeak” (Bradford, 1997)—is a trend among young people in North America (both the United States and Canada), the United Kingdom, Australia and New Zealand, and IndiaFootnote 1 that is intended to project a metamessage of friendliness and concern for the addressee’s perspective, but is often misinterpreted or criticized by those (especially in the older generation) who do not use rising tone in this way. As a third example, research by Estebas-Vilaplana (2014) showed that mechanically manipulated pitch variation in the recorded Spanish and English versions of wh-question and answer sequences as produced by a bilingual speaker elicited different responses from native Spanish and English speakers. Whereas the Spanish listeners judged the Spanish responses spoken with low pitch as polite, the English listeners judged most of the English responses spoken with low pitch as unexpected and rude, while judging the English responses spoken with a high pitch range as natural and polite. These are telling examples of how significant a person’s pronunciation can be, intonation specifically, in the meaning conveyed to a specific audience.

Accent and Stereotyping

The general features of speech, including phonemes and prosody, give a certain impression of speakers and their status. Such general features are often labeled accent (see further discussion below). To give an example of how accent can convey different things, to some people, an American Southern accent signifies a charming or cultured person while to others it signifies low social status or lack of education (Campbell-Kibler, 2007). As another example, Americans typically think of people who speak with a standard British accent as charming, cultured, and educated, while Australians may consider those who speak with the same British accent as “stuck up.” On the other hand, not all British accents have these sorts of associations.

Linguistic stereotyping based on accent is a quick way to classify people. The same kinds of evaluations are applied as well to L2, “non-native,” accents. For example, people often say that English spoken with a French accent sounds emotional or romantic while English spoken with a German accent sounds unemotional or formal. Some linguistic stereotypes are quite negative and relate to marginalized social status, as is the case in Hong Kong for Filipino English (Lowe, 2000). These different responses to accented speech often stem from historical facts (e.g., that the British were the ruling class in America at one time or the fact that the majority of Filipinos in Hong Kong are in domestic service), or from characteristics of English as spoken with the features of a particular language (e.g., the prevalence of glottal stop and the lack of linking or coarticulation (also known as sandhi) between words in German-accented English, which to an L1 English speaker may seem like emphatic or formal pronunciation). What is perceived as a foreign accent may be associated with negative and often unconscious stereotypes (Gluszek & Dovidio, 2010) as well as negative emotions towards speakers related to difficulties in understanding what they are saying. Problems in intelligibility may cause processing difficulty that makes listeners judge those they perceive as having a “heavy” foreign accent as less credible (Lev-Ari & Keysar, 2010).

Stereotypes that listeners connect to a person’s way of speaking can have significant and wide-ranging effects (see also Chap. 7). As Tannen (2014) points out, “negative stereotypes can have important social consequences, affecting decisions about educational advancement, job hiring, and even social policies on a national scale” (p. 372). In employment, a person may be discriminated against or considered to be disqualified for a certain job based on the person’s language variety or accent (Pennington, 2018b). Discrimination in employment, both hiring and advancement, is a well-known and widespread phenomenon in the case of African Americans who speak a distinctive, African-influenced variety that has been variously referred to as African American Vernacular English (AAVE), Black English, or Ebonics. John Baugh documents the negative “linguistic profiling,” which he defines as “the auditory equivalent of visual ‘racial profiling’”(Baugh, 2003, p. 155), that has dogged Black Americans based on their language and resulted in discriminatory practices in employment as well as in housing and other areas of life.

Another case of negative and discriminatory linguistic profiling can be cited in Hawaii, where there was a long-standing tradition that became increasingly prominent in the last quarter of the twentieth century of excluding teachers of Filipino background who were well-qualified in terms of their educational credentials and who were fluent English speakers from teaching in local schools based on “accent discrimination” (Chang, 1996, p. 139). School principals and members of the state Department of Education justified the status quo under the rationale that the majority of local students would not be able to understand or relate to the Filipino teachers. As another example of discrimination or profiling based on accent, in the early 2000s the state of Arizona justified assessing English teachers’ accents as a requirement for being a “qualified” English teacher until this was challenged as violating teachers’ civil rights (Ballard & Winke, 2017, p. 122). Even as people are exposed to more and more different varieties and accents of English through media, travel, and the global flows of migrants around the world, it seems that discrimination based on accent is alive and well, and may even be on the rise in both the United States and the United Kingdom, as Moyer (2013, p. 172) maintains.

Baugh (2003) also points out the converse, positive form of discrimination that is part of linguistic profiling, such as the favoring of white applicants for jobs or housing based on their “standard English” accent, or the favorable attitudes that Americans have of some L2 accents (e.g., British-accented English or French-accented English). Yet it must be pointed out that positive discrimination for some based on accent, such as a standard or prestige accent, automatically implies preferential treatment for them at the expense of discriminatory treatment of others.

Pronunciation as a Value-Added Factor in Communication

As this discussion has shown, pronunciation is not only a central and necessary aspect of communication to master, but in the best case is an aspect of spoken language that can result in positive interactions and add value and impact in aspects of life that depend on language and effective interaction with others. It is therefore an important basic as well as value-added factor for much of social, academic, and professional life centering on spoken language communication (as discussed further in Chap. 7). In the negative case, a person’s pronunciation, of both individual phonemes and prosodic features, interferes with understanding what the person is trying to say (the message) and with interpreting what the person means (the metamessage). In the worst case, it can lead to serious miscommunication, misunderstanding, and negative attitudes and also be an aspect of negative and discriminatory linguistic profiling and the various types of social disadvantaging and discrimination that are associated with negative assessments of a person’s language. Attitudes towards a person based on pronunciation are often the result of historical factors and stereotypes and so long standing and relatively automatic.

Pronunciation and Identity

People manage impressions by communicating their communal and individual identity in a variety of ways that they consider effective for presenting themselves and conveying their intentions to different audiences and for different purposes. “Language is central to speakers’ alignment with and against various role models and groups, as speakers project an identity by adopting linguistic features of those with whom they most associate and identify socially and psychologically” (Pennington et al., 2011, p. 178). The phonological conventions of different communities offer resources for individuals to project their affiliations as well as aspects of their identity (Zuengler, 1988) through the pronunciation of individual sounds, prosodic features, and accent. A certain type of prosody, such as the rising intonation in statements that is characteristic of “Upspeak” (Bradford, 1997), or pronunciation of a phoneme in a certain way, such as the pronunciation of Spanish z (e.g., zorro “fox”) and c (preceding i and e, e.g., cielo “sky” or cebra “zebra”) as interdental [θ] (e.g., by a Latin American Spanish speaker or North American English speaker), can be employed to intentionally project a certain image, affiliation, or identity to the audience.

A person’s pronunciation is an indicator of the identity and community membership(s) which that person claims and projects to others. Identity is something which is created dialogically, in interaction with others whom one associates and identifies with (Bakhtin, 1984/1929) in speech communities or “communities of practice” (Lave &Wenger, 1991; Wenger, 1998), which Wenger describes as “a group of people who share a concern or a passion for something they do, and learn how to do it better as they interact regularly.”Footnote 2 In a community of practice, specific knowledge and skills are valued and provide access and proof of membership. As Pennington (2018a) observes, “language learners can maintain a strong identity in one or more communities of practice where their primary language is dominant even as they also aspire to and cultivate status in one or more communities of practice in which a second language is dominant, such as a school, a language class, a Web community, or a multicultural group of friends” (p. 93). Language, and specifically pronunciation, is a central aspect of identity that is tied to many other aspects of identity, such as country and region of origin, ethnicity, culture, education, and profession.

Since the language and specific variety or varieties of language which a person speaks communicate much about the person’s identity and aspirations, and also provide social access and communicative power in specific communities and circumstances, learning a new language can affect the person’s identity and opportunities:

Learning a second or additional language means acquiring a new way of communicating and presenting oneself that can open a person’s identity to change, making identity more malleable and offering opportunities to experiment with new communicative features, such as accent or prosody, and with the social and cultural attributes of the new language and its associated discourses. It also means gaining access to new groups and communities of practice where new knowledge and behaviors can be developed that make it possible to participate in new discourses and to have a role in shaping those communities and discourses, thus enhancing a person’s social and communicative power. Learning a new language can confer social status and can widen opportunities for education, employment, and new experiences that can impact identity. (Pennington, 2018a, p. 94)

These points define important positive aspects of language learning in general and pronunciation learning in particular that teachers need to be aware of and to consider with reference to the students they teach. At the same time, language teachers need to be aware that a learner’s core identity, in being strongly interconnected to the learner’s language and linguistic identity, may not be an easy thing to change and may even represent a felt threat to identity (Pennington, 2018a, p. 95).

As people seek to expand themselves and their experiences and opportunities by learning to speak an additional language, they naturally start from what they already know. Learning to speak a second language begins from a learner’s identity, perceptions, values, and learned behaviors involving the mother tongue or L1, as connected to other aspects of the learner’s identity, perceptions, values, and learned behaviors. The learner’s L1 and the many associated areas of knowledge and identity provide a cognitive, psychological, and social foundation for tackling the tasks of learning a new language as well as a perceptual basis for hearing another language and an articulatory basis for speaking it (see Chap. 2).

Key Concepts of Applied Phonology

Phonemes and Their Contextual (Phonetic) Variants

The sound system of a language consists of its individual phonemes, the distinctive consonant and vowel sounds of the language, and their contextual variants (or allophones), the specific pronunciations of the phonemes in different contexts. All of the speakers of one language or language variety share the same phonemes. Yet there is a tremendous amount of variation in the exact pronunciation of the phonemes of a language, in the way of positional variation of phonemes in sequence as well as regionally and socially conditioned variation. Phonemes not only have different pronunciations in different linguistic contexts, they also have different regional variants (variants associated with different regional accents) and social variants (variants associated with different social groups, such as male and female speakers, upper class and middle class speakers, and different ethnic groups, as well as with different speech styles, such as casual and careful speech). Often variant pronunciations signal sound changes in progress, with some variants representing older features of a language and others representing newer features which have been introduced into the community such as through popular media or new speaker groups (Labov, 2001) and which are spreading.

The different types of variant pronunciations of phonemes are phonetic variants; because they do not differentiate the meaning of words, they are not phonemic. Although the sounds of a language may be described or transcribed at a general (phonemic) level that does not include the detailed phonetic analysis of articulation in different contexts and for different speakers, if the focus is on social or regional characteristics, on the differences between languages or language varieties, or on problems in communication stemming from pronunciation, it will often be necessary to pay attention to phonetic detail.

Listeners may respond differently to specific regional and social variants, as found, for example, in a series of investigations carried out by Campbell-Kibler (e.g., 2007, 2011) using a matched guise technique in which recordings of speech were digitally manipulated to provide pairs of speech samples that differed only in which of two phonetic variants occurred (e.g., an alveolar [n] or velar [ŋ] variant for –ing, or a fronted or backed variant of /s/). Respondents then rated the samples in terms of a series of adjectives for describing the speaker (e.g., relating to regional background or to characteristics such as intelligence or education). According to the results of the 2007 study, American listeners from both the South and the West were more likely to perceive an accent as “Southern” in the [n] guise for –ing and less likely to perceive it as “gay” or “urban.” Both studies showed that listeners rated speakers as less competent in the [n] variant guise for –ing and more competent in the [ŋ] variant. This result perhaps reflects the fact that the [ŋ] variant is more common in careful and middle-class American and British English speech while the [n] variant is more common in casual and working class speech (Labov, 1966, 2001; Trudgill, 1974). In addition, the 2011 study found that /s/-fronting caused listeners to rate male speakers as less masculine, more gay, and less competent.

As illustrated in these studies, variant pronunciations may be associated with a range of listener perceptions and evaluations of speaker characteristics, cueing geographical origin or urban/rural background as well as socioeconomic status, education, intelligence, competence, and personal characteristics. A variant that is associated with status or advantage in terms of education or economic power that would normally be considered prestige in a society may be labeled a prestige variant, such as the [ŋ] variant of –ing as contrasted with the [n] variant. The positive or negative status of regional and social variants is not such a clearcut matter, however, as a prestige variant may be negatively valued in some contexts, such as the [ŋ] variant of –ing if used in casual speech among friends (e.g., giving an impression of an inappropriately formal or careful style). Conversely, a variant that has some negative associations, such as [n] for –ing (Labov, 1966, 2001; Trudgill, 1974) or glottal stop for medial /t/ (Tollfree, 1999; Wells, 1982, pp. 324–325; Williams & Kerswill, 1999), can also have a kind of “covert prestige” (Labov, 1966, 2001) as signaling solidarity or membership in a specific group (e.g., a racial or ethnic group) or speech community.

Phonetics and Sociophonetics

Phonetics traditionally distinguishes the two branches of articulatory phonetics, which studies the physiology of speech and how speakers form, or articulate, individual sounds and combinations of these in longer utterances, and acoustic phonetics, which studies the properties of sound waves in speech and how these are perceived. Sometimes, a separate branch focused on speech perception is distinguished, that of auditory phonetics. The term sociophonetics, which came into widespread use starting in the 1970s in relation to Labovian variationist sociolinguistics (Foulkes, Scobbie, & Watt, 2010), is now used by many linguists to refer to the study of phonetic variation in speech that is socially meaningful, such as the differences in socioeconomic status and speech style conveyed by pronunciation of postvocalic /r/ (Labov, 1966), –ing (Labov, 1966, 2001; Trudgill, 1974), or medial /t/ (Tollfree, 1999; Wells, 1982, pp. 324–325; Williams & Kerswill, 1999) in various American and British speech communities; the social attributes (e.g., “urban,” “gay”) conveyed by the [ŋ] variant of –ing and by a fronted variant of /s/ Campbell-Kibler (2007, 2011); the differences in politeness and audience orientation conveyed by falling and rising pitch (Bradford, 1997; Cruttenden, 2014; Gumperz, 1982) and by high or low pitch (Estebas-Vilaplana, 2014); and, in general, the pronunciation features used in the construction or cueing (by speakers) or the perception and interpretation (by listeners) of individual or group identity (Hay & Drager, 2007). Intentional uses of prosodic and segmental phonology in communication as contextualization cues to metamessages are a type of behavior which, since it involves conventionalized social meaning associated with pronunciation, falls within the remit of sociophonetics.Footnote 3

The full study of phonology requires some attention to phonetics, that is, to the details of sounds and how they are produced, including positional variation and larger contextual effects on articulation, especially the different kinds of meaning conveyed by socially conditioned (i.e., sociophonetic) variation. In both the traditional sense of phonetics and the expanded sense of sociophonetics, the details of pronunciation and how those details are interpreted in terms of a speaker’s meaning and intentions are especially important for L2 speakers as well as for language teachers and researchers.

Production and Perception

A person’s pronunciation skill or competence has a mechanical aspect in terms of the functioning and control of vocal organs that is required for speech production. From this mechanical perspective, a person’s knowledge of pronunciation involves the manipulation of the physiological organs forming a system of breath, resonance, resistance, and movement that makes speech possible. This is a system connecting the organs that control breathing (the diaphragm, the lungs, and the trachea) and that allow intake and outflow of breath (the mouth and the nose) with the articulators (lips, teeth, tongue, palate, jaw, pharynx, uvula, vocal folds) and their physical and mechanical properties. Speech is the result of a speaker’s actions to manage and shape the air coming up from the lungs in complex ways that produce all of the variations in sound waves which people perceive as specific phonemes (consonants and vowels) and prosodic cues to meaning conveyed by pitch (tone or intonation), length or duration (rhythm), and volume or amplitude (loudness). Any or all of these features of prosody may contribute to meaning by cueing prominence in words (stress) and larger units (accentuated components of a message), the grouping of words into units, and the communicative function of utterances.

Speech is planned as a message that the speaker wants to get across and so is produced not with a goal of articulating individual sounds or words, but with a goal of putting those together to generate meaningful units and coherent stretches of spoken language. Because the generation of speech is usually performed in real time, with limited time to plan and produce an utterance that will convey the message which the speaker intends, there will be trade-offs between some aspects of message production as against other aspects. If a speaker is able to put pronunciation on “auto-pilot,” that is, to set the articulators in a certain way and to speak according to well-established neuromuscular instructions and articulatory routines, the speaker will save cognitive and attentional resources for attending to other aspects of speech production. To the extent that they are able to do so, this is generally the path that speakers follow.

From the perspective of speech perception, a person’s pronunciation competence involves the ability to discriminate auditorially, through (the human faculty of) hearing, the consonant and vowel sounds (phonemes and phonetic variants) of a language and its other vocal signals of meaning in prosody. In order to be able to decode and understand what a speaker is saying, a listener needs to have skills of phonological perception that involve recognizing and decoding segmental phonemes and prosodic patterns and relating these to known words, grammatical constructions, and meanings, including a wide range of pragmatic meanings and communicative effects. Thus, in addition to being able to produce and perceive the segmental and prosodic components of the system, pronunciation competence in both production and perception requires a knowledge of the conventions linking features of pronunciation form to meaning and function, including how they can cue different contextual frames and metamessages (pragmatic and social meaning).

Pronunciation and Spelling

An orthographic system, which is a set of symbols for writing language down, incorporates a set of conventional symbols for spelling sounds and words. For many languages, the correspondences between phonemes and orthographic symbols (graphemes) are not one-to-one but many-to-one (i.e., different phonemes are spelled the same way) and one-to-many (i.e., one phoneme is spelled in different ways). The lack of correspondence between pronunciation and spelling is more extreme in some languages than others. For example, Spanish and Hawaiian have considerably less variation in sound-spelling correspondences than do French or English. The spelling of languages with a long history of writing and/or a long history of influence from other languages (e.g., English) is not a good guide to pronunciation, nor is pronunciation a reliable guide to spelling. An example for English of one-to-many correspondence of a single phoneme to many different spellings is the homonym (homophone) set I, aye, and eye made from the vowel phoneme /aɪ/, which gives a taste of the various spellings of this vowel phoneme in English words, including not only those in these three words, but also, when co-occurring with at least one consonant: y and uy (by/buy); ye and ie (dye/die); ig, igh, and eigh (sign, high, height); and i before a consonant with final silent e (sine, hide). An example for English of many-to-one correspondence in which different phonemes are spelled the same way is the various phonemes that can be spelled with a single letter o, such as in do /u/, done /ʌ/, bosom /ʊ/ (first syllable) and /ə/ (second syllable), co-operate /o/ (first syllable) and /ɑ/ (second syllable), people (unpronounced or silent o).

Voice Quality and Articulatory Setting

As the sound system of a language or variety of language, phonology is tied to linguistic meaning and shared conventions which speakers draw on to communicate in an intentional way. Unintentional or unconventionalized vocal sounds, that is, those which do not signify consistent distinctions and patterns of meaning (e.g., involuntary cries in response to fear, pain, or shock; emotion-induced changes in pitch or other voice characteristics) are not considered to be part of language and so also not part of phonology. These are aspects of communication which, together with facial expressions and gestures, are often classified in the category of paralanguage. Learned and controllable vocal characteristics and segmental features can be employed intentionally as contextualization cues to project metamessages of different kinds through pronunciation, based on conventionalized prosodic and segmental patterns and their associated meanings, including situational affect and attitudes towards the audience (e.g., accommodating or condescending) and to other aspects of the speech event (e.g., sarcasm or irony), as well as aspects of personality and identity (e.g., friendliness or assertiveness, gender or class identity). Uncontrollable, natural physiological vocal responses which reveal a person’s emotional and physical state (e.g., excited, relaxed) are generally considered to belong not to phonology or pronunciation but to the domain of paralanguage. At the same time, natural emotive responses are not entirely separate from conventionalized prosody, as these are very likely the basis for conventionalized prosody such as the various meanings of raised pitch. Thus, the dividing line between phonology per se and what is considered paralanguage is not clearcut, and paralanguage is sometimes defined in a way that includes prosody, especially intonation.

Recognizable and identifying individual voice characteristics that in singing are referred to as timbre, in speaking are often referred to as voice quality, the characteristics of a given voice spanning stretches of speech (Laver, 1980). Van Leeuwen (1999) describes the dimensions of voice quality and timbre as tense/lax, rough/smooth, breathiness, soft/loud, high/low, vibrato/plain, and nasality. At an individual level, voice quality (e.g., a generally nasal and rough, or raspy, voice) and pervasive features of articulation (e.g., lisping) can differentiate and identify a specific speaker from others in the same speech community (though same-gender family members often have recognizably similar voice characteristics). Voice quality has also been associated by van Leeuwen (1999) and others (e.g., Yau, 2010) with different kinds of pragmatic meaning, functioning as a contextualization cue to metamessages. Yau (2010) gives the contrasting examples of customers’ “hot anger,” expressed by loud voice and/or high pitch (p. 117), versus “cold anger,” expressed by soft voice and/or low pitch (p. 118). Pervasive articulatory features can likewise cue different kinds of pragmatic meaning and function as contextualization cues to metamessages, such as intentional lisping to signify childlike speech and hence naivete or silliness.

In addition, voice quality and “the physical postures of the articulators that produce a particular voice quality” (Pennington, 1996, p. 156) and a consistent shaping of articulation throughout speech, termed articulatory setting (Honikman, 1964) or vocal setting, have also been recognized as distinctive for different languages and varieties of language (Collins & Mees, 2013, p. 60). When articulators are set in a certain way, the articulatory setting provides a sort of mechanical or motor template for the production of speech that aids automaticity and fluent speech production. Specific articulatory settingscombining such features as the posture of the tongue (e.g., as fronted or backed, or as having the tongue tip tapered or not), jaw opening (e.g., as relatively open or relatively closed), lip shape (e.g., as relatively spread, rounded, or neutral), and the posture of the vocal folds (e.g., as relatively tense or slack, or as shaping a certain type of glottal opening)—are associated with different accents and can identify a speaker’s L1, language variety, or dialect. Collins and Mees (2013, p. 61) describe a range of articulatory settings involving tongue position as characteristic of different British English accents or varieties, and they contrast the articulatory setting of non-regional British English pronunciation (“loose lips, and relaxed tongue and facial muscles”) with that of French (“pouting lip-rounding, and tense tongue and facial muscles”). Although they are sometimes classified as outside phonology proper (i.e., as paralanguage), both conventionalized voice quality, as an indicator of pragmatic meaning, and articulatory or vocal setting, as the underlying articulatory basis of different languages, can be included within the domain of phonology or pronunciation.

Accent and Accentedness

It is a common misperception that speech can be accent-free, stemming from people’s bias towards a familiar style of pronunciation. Munro, Derwing, and Morton (2006) define accentedness as “the degree to which the pronunciation of an utterance sounds different from the expected production pattern” (p. 112). As Pennington (2018b) notes:

People tend to perceive accent or accentedness in relation to a certain kind of pronunciation, which may be that of their own reference group or that of a standard language, as being the “normal”—that is, the common or expected—pronunciation or the “correct” pronunciation. When considered against the listener’s pronunciation baseline or model, other pronunciations will be perceived as more or less divergent or marked, or more or less accented, in comparison to the unmarked pronunciation of the reference group or baseline, which appears to be unaccented or less accented…. Yet even the pronunciation of a standard language is accented: it is a standard accent rather than, say, a rural accent or a minority group accent.

Given the fact that what is the usual or expected pronunciation is entirely relative to the perceiver’s concept of a standard or baseline, it can readily be seen that everyone has an accent.

A person’s accent can be considered as those features of pronunciation that distinguish the person as coming from a specific country, region, or social group, including segmental as well as prosodic features. In the characterization of Moyer (2013):

Intonation, loudness, pitch, rhythm, length, juncture and stress are among accent’s many features; all of which classify speaker intent as they encode semantic and discursive meaning: accent is a medium, through which we project individual style and signal our relationship to interlocutors. Even more broadly, it reflects social identity along various categorical lines. (p. 19)

Accent may not be the sum total of a person’s pronunciation features but rather certain features of pronunciation which are more salient or distinctive as representative of a person’s group origin or affiliation than others, and which may endure even as other features change. These may include regional or social variants, such as the rhotic and nonrhotic pronunciations of /r/ researched by Labov (1966) in New York City, or L2 variants, such as the different pronunciations of /r/ by German and Greek speakers of English researched by Beinhoff (2013). A speaker’s origin may be detectable in features of accent even many years after changing geographical location or social affiliation. On the other hand, people often pick up new accent features from different places where they live.

As Levis (2016) notes, “[h]aving an accent that fits into a given social group may have benefits” (p. 154), such as the following:

  • Cementing social bonds, as a key marker of social identity;

  • Demonstrating social affiliation and so helping to gain access to social networks;

  • Attributing the qualities of a leader to a person;

  • Determining whether listeners will want to interact with a speaker and thereby affecting the availability of opportunities for language acquisition. (summarized from p. 154)

Yet, as Levis maintains,

…accent also has a dark side…. Four common consequences of accent (or even perceived accent) are isolation, an unfair burden on L2 speakers in communication, discrimination, and perceived social stigma. (ibid.)

Although accent can be distinguished from language competence, as a person with a detectable regional or L2 accent may be a highly competent speaker, linguistic stereotyping may nonetheless evaluate what is perceived as a “strong” accent as an indicator of limited competence in language and other things such as intelligence or education.

Accuracy and Fluency

Speech Production

Accuracy of pronunciation is a matter of articulating phonemes as intended so that they can be recognized by an audience as correct according to a certain system of distinctions between sounds. To develop high accuracy of pronunciation requires learning to both perceive and produce phonemes and their variants according to the norms of the community to which pronunciation is referenced. This may mean developing new targets for production that move away from inaccurate ones, such as those based on a different speech community or language, most especially a language learner’s L1 (see discussion in Chap. 2).

Being able to pronounce a language accurately is an automatic result of learning a language as L1 but not necessarily as L2. In the early stages of learning a language, L2 speakers will need to attain a pronunciation threshold, that is, a level of pronunciation accuracy which makes them understandable to others, while at the same time balancing all of the other demands of speaking. Once the threshold level is reached, learners are likely to focus on other aspects of speech production while backgrounding the achievement of full pronunciation accuracy as a goal. In the meantime, most L2 learners will draw on phonological similarities between the L2 and their L1 in applying their L1 categories, articulatory settings, and pronunciation mechanics to production of L2 speech. The pronunciation of most L2 learners will therefore show, to a greater or lesser extent, inaccuracies that derive from applying the phonetic features and motor template of the L1 and from their lesser knowledge and automatization of L2 lexis and grammar, which creates a higher cognitive load in speaking.

Pronunciation accuracy is facilitated by focusing on the auditory features and the articulation of sounds both in isolation and in context, but a focus on pronunciation takes both time and attention away from other aspects of communication that more typically command speakers’ attention. Speakers are therefore prone to automatize pronunciation to the greatest extent possible, following known and practiced routines for articulating phonemes and for realizing larger phonological patterns of coarticulation and prosody. These are generally based in a speaker’s L1 and the specific varieties or dialects that the speaker commands. Engaging practiced articulatory routines and prosodic patterns, according to an automatized motor production template that allows a sort of “pronunciation auto-pilot” to function, can free up cognitive resources; and backgrounding pronunciation makes it possible to foreground meaning and lexical search in planning and producing speech in real time. At the same time, minimizing the conscious attention paid to pronunciation can reach a point of diminishing returns in terms of ensuring sufficient distinctiveness for understanding. For both L1 and L2 speakers, focusing on pronunciation accuracy may require explicit control that interferes with other aspects of message generation and slows down speech.

As Levinson (2000) observes:

[I]t is…possible to identify a significant bottleneck in the speed of human communication—a design flaw, as it were in an otherwise optimal system. The bottleneck is constituted by the remarkably slow transmission rate of human speech (conceived of as the rate at which phonetic representations can be encoded as discriminable acoustic signals), with a limit in the range of seven syllables or 18 segments per second…. (p. 28)

Levinson cites figures suggesting that the cognitive processes preceding articulation in speech production and comprehension take place three to four times faster than the rate at which a person is able to articulate speech. As he goes on to state:

It is this mismatch between articulation rates on the one hand, and the rates of mental preparation for speech production or the speed of speech comprehension on the other hand, which points to a single fundamental bottleneck in the efficiency of human communication, occasioned no doubt by absolute physiological constraints on the articulators. (ibid.)

Whereas drawing on L1 targets and mechanics for articulating the L2 results in some inaccuracy, it promotes fluent production while also making it possible to focus on other aspects of speech production involving the generation of meaningful lexicogrammatical units. Accepting a degree of inaccuracy in pronunciation may seem to be a reasonable trade-off for maintaining a focus on generating meaningful and coherent speech, though the degree of inaccuracy in individual phonemes, intonation, and other aspects of prosody can interfere with meaning and coherence.

Fluency, which is taken as an indicator of coherence and of highly proficient or “nativelike” speech in a second language, is often associated with notions of effortlessness, flow, or “fluidity” of speech (Browne & Fulcher, 2017, p. 38) and with notions of continuity and timing of speech (Dalton & Hardcastle, 1977; Lennon, 1990). As Browne and Fulcher (2017) have observed, “Fluency is as much about perception as it is about performance” (p. 37). Thus, discrete measures of temporal fluency having to do with the speaker’s timing of speech (Lennon, 1990)—including speaking rate, proportion of pausing, and presence of disfluency markers such as pause fillers or hesitators (e.g., um, er), false starts, and repairs—which are aspects of fluency that are quantifiable and measurable by humans or machines, may not equate to perceived fluency as a global or holistic measure.

Lennon (1990) noted that fluency could be defined broadly, as more or less equivalent to overall speaking proficiency, or narrowly, as a component of speaking proficiency that can be assessed separately. In Lennon’s (1990) view, fluency in the narrow sense can be taken to refer to the listener’s impression “that the psycholinguistic processes of speech planning and speech production are functioning easily and smoothly” (p. 391), exhibiting what Segalowitz (2010) labeled cognitive fluency, that is, “the efficiency of operation of the underlying processes responsible for the production of utterances” (p. 165). This narrow sense of fluency is what Brumfit (1984) described as the “psychomotor” aspect of fluency encompassing speed and continuity. In this narrow way of conceptualizing it, fluent speech would be characterized by relatively long and continuous stretches of speech with relatively short stretches of silence (i.e., pauses) and relatively few pause fillers or other types of disfluencies such as false starts. Disfluent speech would be marked by short and discontinuous segments separated by long and/or frequent pauses, pause fillers, nonmeaningful repetitions, stutters, incomplete thoughts, and other indications of nonautomaticity, difficulty, or loss of control while speaking.

This narrow conception accords with a common view of fluency as speaking without hesitation and of disfluency as hesitant or halting speech. According to Fayer and Krasinski (1995), the amount of pausing—measured by total pause time, percentage of pause time, and especially the length of the longest pause in discourse—result in listener judgements of speech as hesitant or not. Perceptions of a speaker’s hesitancy or overall continuity and discontinuity of utterance are a key factor for assessing spoken language performance as skilled or unskilled, native or non-native, normal or abnormal (as in disorders of fluency, see Dalton & Hardcastle, 1977, and Chap. 7). A speaker’s hesitations in speech are also a key factor determining whether a listener remains focused on the speaker’s message or becomes distracted and annoyed (Fayer & Krasinski, 1987). At the same time, in this narrow conception, fluency is defined without reference to meaning or the content of talk. For many purposes, this narrow definition would not suffice, since it does not differentiate fluent and meaningful speech from fluent but meaningless, incomprehensible, incoherent, or minimally informative talk.

Fluency is a concept associated with natural-sounding and non-hesitant speech produced in quantity and at a relatively quick rate. Fluency with a focus on pronunciation is a matter of coordinating segmental and suprasegmental aspects so that articulation occurs in a relatively continuous flow of speech, with few gaps or disruptions. Fluent production is focused not on individual sounds but on connected speech and so on units larger than phonemes, syllables, or individual words. At the level of phonology, fluency therefore is more about the prosodic than the segmental level although, contrary to the relatively strong stress features that come into play when a speaker is aiming for high accuracy, fluent production often results in a rapid articulatory rate, with destressing and weakening of boundaries between syllables and words, and thus the extensive coarticulation that is normal in natural and automatized speech production. As Hieke (1985) observed, “Fluent speech is the cumulative result of dozens of different kinds of processes” (p. 140), including linking, levelling (i.e., assimilation), and “outright loss” (ibid.). It should be noted however that fluent speakers can vary speaking rate according to their intentions, such as slowing speech down to emphasize the part of a message expected to be new for the audience and to cue them to pay attention, or speeding through a stretch of speech to cue the part of a message that is expected to be received as “old news.”

To develop fluency at the level of pronunciation—that is, what has been called “phonological fluency” (Pennington, 1989, pp. 26–27, 1990, pp. 546–549) as distinct from fluency in a global sense that incorporates lexical and grammatical choices—a speaker must learn the ways in which sounds are preserved and altered in their connection with other sounds in context. Fluent speech processes are not necessarily the same from one language to another. For example, Delattre’s (1981) analysis of speech in four languages found a tendency in English for vowels to centralize towards the position of schwa [ə] under conditions of fluent production but no comparable tendency in the other languages investigated. Speakers who use articulatory sequencing or connecting strategies from the mother tongue may develop fluency in the second language with an L1 accent. Speakers who focus on the articulation of individual sounds or words in isolation, in contrast, may develop clear and accurate but non-fluent speech in a second language.

Speakers may be highly fluent and yet inaccurate in the sense that their pronunciation diverges in small or large degree from L1 or other speech community norms, making it hard for hearers from a certain speech community to understand speakers who originate in a different speech community. Pronunciation accuracy (or intelligibility, see below) according to “audience-determined norms is…an important goal, especially for those who must convey information to…native speakers, such as teaching assistants in undergraduate courses, supervisors in businesses or people who must speak to clients over the telephone in the target language” (Pennington, 1996, pp. 220–221). It is also important for L2 speakers communicating with each other, since a certain degree of pronunciation accuracy is required for any communication to take place and, beyond that, to avoid miscommunication when the pronunciation of a particular phoneme or word is confused with another. As a different kind of problem, L2 speakers may have accurate articulation of individual sounds or words in the L2, yet lack the ability to put sounds or words together smoothly into longer continuous stretches of speech, thus making it difficult to communicate and also to capture and hold the attention of any audience.

Fluency, especially in the sense of temporal or phonological fluency resulting from automaticity (both cognitive and mechanical) in speech production, is a key goal in language learning as it makes it possible to focus away from articulation and more on meaning and other aspects of communication and the communicative context (e.g., speaker response) and also marks one as a competent speaker. Non-fluent, “over-hesitant speakers are likely to have difficulty communicating with native listeners for any length of time” (Pennington, 1996, p. 220), and other aspects of fluency are interconnected with phonological fluency, as discussed in Pennington (1990). Putting together words and syntax in coherent and meaningful grammatical units, displaying what can be considered lexicogrammatical fluency, presupposes a certain degree of phonological skill as a basis for fluent production. Conversely, the degree of mastery of syntax and lexis is a limiting factor on phonological fluency, that is, the ability to produce continuous speech in relatively long “lexical chunks” and grammatical units. Fluent speech is therefore in a basic sense discourse-level speech, and a language learner’s level of discourse competence, communicative competence, or overall proficiency is closely tied to the level of fluency in speaking an L1, in the case of a young child learner, or an L2, in the case of an older learner. Defined in the broadest terms, fluency in a language is equivalent to communicative competence or proficiency, and this broad competence or proficiency is strongly based in phonology.

Pronunciation accuracy and phonological fluency are important goals in speech that require a careful balance, and maximizing one in favor of the other can interfere with communication, especially (but not exclusively) in the L2 case. High pronunciation accuracy is often achieved by a speaker through a high degree of control of articulation of individual sounds or words, resulting in relatively deliberate and effortful production that slows speech down and runs the risk of the speaker (and listener) losing the overall coherence of talk. A focus on accurate production of individual sounds or words is comparatively tiring, for both the articulators and the brain of the speaker, and may lose audience attention on message. There is a trade-off, however, in that rapid production of speech following a relatively automatized motor template, while it can produce high phonological fluency, can also lose articulatory distinctiveness, especially the specific quality of vowels and the details of articulation of consonants following vowels. As vowels and consonants become less distinctive and less distinct as words run together, individual phonemes and words may become less intelligible—a common outcome in both L1 and L2 interaction that leads to degraded communication, clarification requests, and attempts at repair that typically require moving away from rapid production of fluent speech.

It is therefore sometimes necessary for a speaker to deliberately abandon a speaking strategy in which pronunciation is on “auto-pilot,” following an automatized motor template or a goal of rapid or fluent production with a focus on meaning or semantics, and to switch to a more controlled pronunciation strategy, paying attention to language form and accuracy, in order to make sure that the audience understands what is being said. Speakers are especially likely to shift the focus in speech production away from meaning/semantics and towards language form and pronunciation accuracy when it seems that a specific addressee has not understood, as Labov (1966) illustrated in his New York City department store study demonstrating the tendency of store clerks to give a rhotic pronunciation of postvocalic /r/ when they thought their first production of fourth floor had not been understood. This is an important fact for instruction that aims to focus learners’ attention on pronunciation and its contribution to meaning (see Chaps. 3 and 4).

Speech Perception

Although accuracy and fluency are usually discussed as aspects of production, they can also be considered in relation to perception. Native speakers of a language who have normal hearing and intelligence can be assumed to develop perceptual accuracy, the ability to recognize and differentiate distinct linguistic items, units, and patterns, and to link them to meaning, as well as perceptual fluencyFootnote 4 or perceptual automaticity, the ability to recognize and extract the form and meaning of linguistic items, units, and patterns quickly and with a minimum of processing effort. Given that phonology is the surface level of spoken language, perceptual accuracy starts with being able to recognize phonological segments and prosodic patterns, and perceptual fluency specifically incorporates skilled and rapid phonological processing. These kinds of phonological processing operate together with other kinds of processing skills drawing on lexical and grammatical knowledge as well as nonlinguistic knowledge and processing skills (e.g., based on general knowledge and visual information in the context of speech) to generate an utterance interpretation.

For the L2 learner, perceptual accuracy and fluency will take time to develop and so utterance processing may be relatively slow and effortful, and may also be only partial, so that the learner needs to use inferencing skills and context to a high degree in trying to understand speech. In addition, learners’ L2 perceptual processing routines will be based in part in the L1, so that L2 perceptual fluency—to the extent that a learner is able to achieve this—may be bought at a cost to L2 perceptual accuracy. Hence, the demands of rapid decoding and processing of L2 speech for meaning may lead to inaccuracies and errors (e.g., mishearing or misinterpretation of what is heard) or “dead ends” in the way of processing paths which do not result in a meaningful interpretation of a speaker’s utterance.

Nativeness and Pronunciation Competence

As emphasized by a number of those working in applied phonology (e.g., Jenkins, 2000, 2002; Levis, 2005; Pennington, 2015), second language phonology and the teaching of pronunciation have traditionally been focused on nativeness, that is, a native speaker model for performance, as the goal of language learning and teaching. However, given the literal impossibility of being a native speaker of a language when one has in fact grown up speaking a different language as mother tongue, and the difficulty (if not impossibility) beyond early childhood of developing L2 pronunciation that is indistinguishable from the pronunciation of a native speaker (see Chap. 2), the goal for L2 learners is usually stated as one of developing “nativelike” or “near-native” pronunciation. A great deal of the literature on pronunciation is couched in these terms.

Since the 1990s, applied linguists have been questioning the notion of the native speaker in relation to language teaching and L2 speakers in the context of Englishas an international language (EIL; e.g., Davies, 2003; Leung, Harris, & Rampton, 1997; Pennycook, 1995, 2012; Ricento, 2013; Ur, 2012). In Pennycook’s (2012) view, a more suitable criterion than nativelikeness would be that of a “resourceful speaker” (p. 99), meaning one who is “good at shifting between styles, discourses and genres” (ibid.). Pennington (2015) has described this kind of ability with reference to pronunciation in multilingualism or plurilingualism as competence in multiphonology or pluriphonology, involving speaker agency in using more than one language to express different aspects of identity and metamessages, as in the practices of style-shifting (i.e., changing speech style or language to fit the context; Eckert, 2000; Eckert & Rickford, 2001), crossing (i.e., momentary use of a language from a group other than that to which the speaker belongs; Rampton, 1995), and translanguaging (i.e., use of two languages in combination; García, 2009). Ur (2012) observes that the majority of speakers around the world (outside AmericaFootnote 5) have competence in more than one language and that in the context of international English, the majority of those who speak English employ it as a common language, or lingua franca, to communicate with speakers whose mother tongue is not English. For this very large group of speakers, Ur maintains that a notion of language competence is a more appropriate concept for teaching than native speaker proficiency, and she suggests that such competence should be defined in relation to the communicative requirements of those who use English as an international language.

We suggest that a notion of pronunciation competence can perhaps be developed which considers aspects of communicative competence (Hymes, 1972; Savignon, 1983), both receptive (i.e., perceptual) competence and productive competence, that are specifically referenced to segmental and prosodic phonology. Identifying those aspects of communicative competence that are specifically relevant to pronunciation can help to address the insufficient attention paid to pronunciation in both teaching and testing with an emphasis on communication (see discussion in Chaps. 4 and 6). As Ur (2012) emphasizes, such specifications may reference the competencies needed for English as used in international or lingua franca communication. As we would further emphasize, they might reference competencies needed for pronunciation performance in specific types of employment (see Chap. 7) and those related to identity, social meaning, and communicative pragmatics, including “pronunciation resourcefulness” and multilingual/plurilingual aspects of pronunciation in both perception and production.

Intelligibility, Comprehensibility, and Interpretability

Rather than defining it in terms of an external criterion or model of accuracy, nativeness or nativelikeness, or general pronunciation competence, an appropriate way of conceptualizing L2 pronunciation is in terms of intelligibility, which Munro et al. (2006) define as “the extent to which a speaker’s utterance is actually understood” (p. 112). A speaker may have an accent that diverges considerably from that of a native speaker yet be easily understood (depending on how strong the accent is perceived to be, as discussed above, and also how familiar the accent is, as discussed below). In the view of Isaacs and Trofimovich (2012), “in most situations of L2 use, what really counts is L2 speakers’ ability to be understood, rather than the quality or nativelikeness of their accent…” (p. 477).

As Derwing and Munro (2015) have observed, “In the last twenty years, … both research and practice have placed a sustained emphasis on intelligibility, perhaps because there is now empirical evidence, first, that few adult learners ever achieve native-like pronunciation in the L2…and, second, that intelligibility and accentedness are partially independent….” (pp. 6–7). Jenkins (2000, 2002) has argued that mutual intelligibility among L2 speakers should be the main focus of their pronunciation, and she has proposed a set of minimum of phonological features required for mutual intelligibility between L2 English speakers—the Lingua Franca Core (LFC)—with this goal in mind for pronunciation teaching (see Chap. 3 for details). Derwing and Munro (2005) agree with Jenkins “that mutual intelligibility is the paramount concern for second language learners” (p. 380), while also pointing out that

…ESL learners have to make themselves understood to a wide range of interlocutors within a context where their L2 is the primary language for communication and where, in many cases, [native speakers] are the majority. In addition, the purposes for communication may vary to a great extent when immigrants integrate socially in the target culture, which is an important difference from [English as an international language] environments. (p. 380)

Following Smith and Nelson (1985), intelligibility is one of three aspects or components of understanding in communication that can be recognized and differentiated: intelligibility, comprehensibility, and interpretability. Intelligibility, defined by Smith and Nelson (1985) in terms of word/utterance recognition (p. 334), “is interactional between speaker and hearer” (p. 333). Considered in information processing terms, intelligibility refers to the extent to which a listener is able to receive a message as it was intended to be sent and to decode its elements. Intelligibility can be considered a basic indicator of proficiency, in that a speaker must send a message in a certain form in order for a specific addressee or audience to be able to receive it clearly and without distortion. As mentioned above, from the point of view of pronunciation, intelligibility is linked to clarity and accuracy, which makes it possible for a listener to discriminate the message elements and recognize them as meaningful linguistic units—the words and the larger grammatical units composing the message. Intelligibility is also linked to fluency, as the processing of speech (by speaker and listener) can break down when continuity is disrupted. Thus, as Fayer and Krasinski (1995) report, hesitations correlate negatively with intelligibility.

In addition to a speaker’s ability to articulate as intended, intelligibility has to do with a listener’s ability to process the speaker’s utterance. Thus, like accuracy, intelligibility is in the eye of the beholder in the sense that it involves judgement of a speaker’s utterance by a listener—usually a specific addressee or larger audience. Browne and Fulcher (2017), following Field (2005), observe that intelligibility has to do with how a listener processes the phonological content of a speaker’s utterance, which relates to the listener’s familiarity with the speaker’s way of speaking, in particular, the speaker’s accent. As Browne and Fulcher (2017) theorize, “increasing accent familiarity reduces the processing effort required for the phonological content of speech” (p. 40) and so makes that speech more intelligible to the listener. Accent familiarity can be considered to enhance perceptual accuracy and perceptual fluency, thereby speeding up the processing of speech. This is especially helpful when the speaker has heavily accented speech and/or is talking relatively fast, making it easier for the listener to keep up with the speaker’s generation of message components in real time. How fast a person talks is an especially important intelligibility factor for a listener who has limited experience with the speaker’s accent, or, in general, with the language the person is speaking, as the cognitive load of processing an L2 takes considerable effort and time. In addition, speakers articulate less distinctly when they are talking fast, and this can make segmentation of the stream of speech into individual words difficult for a listener, especially but not exclusively an L2 listener, thus impacting intelligibility and making comprehension difficult or impossible.

Comprehensibility, or “ease of understanding” (Munro & Derwing, 1995), is defined by Munro et al. (2006) as “the listener’s estimation of difficulty in understanding an utterance” (p. 112). Whereas research has shown that the perception of accentedness is closely associated with segmental accuracy and other pronunciation factors (Saito, Trofimovich, & Isaacs, 2016, 2017), rather than with grammatical or lexical factors, comprehensibility is a multifaceted judgement that takes into consideration both segmental and prosodic features as well as temporal, lexical, and grammatical aspects of L2 speech (ibid.). It can be noted that hesitancy or disfluency is a factor in “ease of understanding” and hence in judgements of comprehensibility, as listeners can find that their comprehension suffers when a person’s speech is substantially interrupted by pausing or other hesitators. On the other hand, “ease of understanding” can also be negatively affected by high fluency when it co-occurs with high accentedness, as listeners may experience moderate or even extreme difficulty understanding highly fluent speech if delivered in an unfamiliar accent.

Interpretability is the listener’s ability to understand the speaker’s intentions in terms of the communicative function or pragmatic force of the message, requiring functional and situational knowledge and knowledge of language-specific contextualization cues that signal metamessages. Interpretability invokes not only speakers’ and hearers’ linguistic knowledge but also their social knowledge more generally. Although full interpretation of metamessages and of the social function and pragmatic force of an utterance depends on the message being intelligible in terms of recognizing its components and comprehensible in terms of knowing the meaning of its words and grammatical structures, the interpretation of social or pragmatic meaning may be separate from, and may precede, analytical, item-by-item decoding and lexicogrammatical comprehension. Interpretation is a process which proceeds in a cyclical way, with global semantic and grammatical processing starting at an early stage and before decoding has been completed (Harley, 2008, p. 270). Utterance interpretation proceeds at the same time both bottom-up, from micro-level details in the speech signal, and top-down, from more global and higher level information—including the listener’s knowledge of situation, social meaning, and connotation—and cycles back and forth between processing levels. As in all of communication, the process of understanding is in part a guessing game—an inferencing or problem-solving process—of piecing together all of the clues or cues available in the utterance, the context, and the listener’s stores of knowledge.

From the speaker’s perspective, the process of building communication starts at the opposite end of the sequence, that is, at the pragmatic and social level at which a message is contemplated and planned in relation to context and audience, then built into semantic and grammatical units that are executed through a sequence of words, themselves sequences of phonemes or phonetic variants, framed by prosody (Levelt, 1989, 1999). Again, this makes speech production seem more orderly than it is, since speakers in most circumstances do not have time to plan fully at a global or macro level, much less at the micro level of fine details of lexical choice or articulation, before beginning an utterance. As a consequence, they start talking with a general intention of the message they want to communicate, but before they are sure what exactly they will say. This natural human tendency to jump into speech and build the elements of the message while already engaged in talk is one source of errors, low intelligibility, and disfluency. Other sources of errors, low intelligibility, and disfluency are lack of time or attention and L2 status. Lack of comprehensibility can also be a sign that the speaker is talking without full knowledge (e.g., of topic, audience, context).

Concluding Remarks

As stated in the Preface, this book differs from other books on pronunciation in bringing together emphases on teaching and research, and in taking a much broader view of pronunciation than other works that incorporate a practical or applied orientation. The content coverage and orientation of this first chapter, in terms of the topics and concepts presented and our perspectives and emphases in discussing them, give an idea of where we are headed and the ways in which this book might differ from other discussions of pronunciation in language teaching, second language acquisition, and applied linguistics. As we have described and illustrated, pronunciation is a central aspect of human language that goes far beyond learning to articulate individual sounds, incorporating multiple layers of linguistic proficiency and types of communicative competence, including the ability of a speaker to produce and a listener to comprehend the meaning of message components relative to context, as well as expressing features of individual and group identity. Pronunciation can thus be considered from a wide variety of perspectives that can be explored, and have been explored, by teachers in their language classrooms and by researchers in educational and other contexts, and that offer a vast vista of further exploratory opportunities in teaching and research.

An important goal of this book is to present up-to-date information on these different aspects of pronunciation in a way that can provide inspiration, direction, and continuing education for teachers and researchers. With these goals in mind, it includes discussion of practical matters of curriculum and teaching, in addition to initiatives in the description and understanding of pronunciation that are offered from different fields of practical applications and research, and matters of theoretical interest linking to foreign and second language learning and teaching, and to psychology and linguistics. Topics and concepts introduced in Chap. 1 will be the subject of further development in the chapters to come, based on the foundation laid in our definitions, illustrations, and discussion here. In Chap. 2, we build on the foundation established in this chapter to discuss the nature of language learning in both L1 and L2. These two chapters then form the basis for discussion of pronunciation research and practice in the classroom and larger contexts of society that are addressed in Chaps. 3, 4, 5, 6, 7 and 8.