From speech to voice: on the content of inner speech


Theorists have found it difficult to reconcile the unity of inner speech as a mental state kind with the diversity of its manifestations. I argue that existing views concerning the content of inner speech fail to accommodate both of these features because they mistakenly assume that its content is to be found in the ‘speech processing hierarchy’, which includes semantic, syntactic, phonemic, phonetic, and articulatory levels. Upon rejecting this assumption, I offer a position on which the content of inner speech is determined by voice processing, of which speech processing is but one component. The resulting view does justice to the idea that inner speech is a motley assortment of episodes that nevertheless form a kind.

Fig. 1
Fig. 2

Reproduced from Belin et al. (2004)


  1. 1.

    Although see Hurlburt and Heavey (2018) for skepticism about reported frequencies of inner speech.

  2. 2.

    The nature of the speech processing hierarchy remains contested. Psycholinguists disagree about how information flows through the hierarchy – serially or in parallel, feedforward or feedback – and about the exact operations and sub-operations within the hierarchy (e.g., Fromkin, 1971; Dell, 1986). Despite these differences, psycholinguists tend to agree on the organization presented in Figure 1.

  3. 3.

    Although see Langland-Hassan (2018) for a contrasting position on phonemes, according to which phonemes are auditory.

  4. 4.

    Langland-Hassan (2018) and Gauker (2018) differ in their framing of concretism and abstractionism. Langland-Hassan seems to assume that inner speech always represents phonetic content, while Gauker seems to assume that inner speech never has a phonetic vehicle. This difference will not matter in my discussion of the views. For this reason, I will use ‘phonetic/auditory/speech sound component’ with the understanding that it translates as ‘phonetic/auditory/speech sound content or vehicle’.

  5. 5.

    Recall that Langland-Hassan (2018) believes that phonemes are auditory (see footnote 3). Although I have denied this (see Sect. 2), for the sake of the present argument, I will use ‘phonological’ in the sense that Langland-Hassan intends.

  6. 6.

    This line of argument puts into relief a plausible alternative explanation of how I know the language of my inner speech: my knowledge that I am speaking English during inner or outer speech is non-observational in just the way that my knowledge that I am grabbing a glass may be non-observational (see Anscombe (2000)). On this account, I know that my inner speech is in English because I use English words in my inner speech, where this knowledge is not grounded in observation. Although Langland-Hassan seems to assume that the knowledge of the language of our inner speech is gained by introspection, the alternative I have mentioned rejects this restriction.

  7. 7.

    Gauker might reject Levelt’s speech control model, since Gauker may take it to depend on the doubtful existence of a language of thought. However, Levelt’s speech control model does not depend on the existence of a language of thought, since the control model is a model of peripheral processes of speech production, e.g., motor control processes, and not core processes, e.g., concept selection.

  8. 8.

    Langland-Hassan (2014) discusses two problems. One problem—call it ‘the kindhood problem’—concerns how distinct inner speech episodes with different combinations of contents from the speech processing hierarchy all count as being cases of inner speech (pp. 519–520). Another problem—call it ‘the binding problem’—concerns how it is that a single inner speech episode possesses a combination of different contents from the speech processing hierarchy (pp. 520–529). Although these problems are related, a solution to the one does not entail a solution to the other. In particular, an account that states what it is in virtue of which inner speech episodes with various combinations of contents count as being cases of inner speech does not thereby state how any one of those episodes possesses the combination of contents it does. This paper is intended to provide a solution to the kindhood problem, not the binding problem.

  9. 9.

    A leftover possibility is that a state counts as inner speech in virtue of being a part of an aborted speech production process. The problem with this proposal, however, is that being a part of a mental process does not individuate mental state kinds. For example, states corresponding to prediction, prediction error, precision, and data are assumed to be parts of the mental process of prediction error minimization (Clark, 2015). But these states do not belong to some further mental state kind in virtue of being a part of the process of prediction error minimization. Or again the mere fact that motor, sensory, and cognitive states are part of the speech perception process does not entail that they are themselves states of speech perception, nor that there is some further kind to which they belong.

  10. 10.

    One might object that my use of the concept of voice is equivocal. On the one hand, voice refers to information that is extracted from a signal, while on the other hand, voice refers also to the medium within which a signal is communicated. The charge of equivocation stems from the fact that I can come to represent someone’s voice – as a medium – only by extracting information from a signal. This can make it seem as if voice is also represented as information extracted from a signal (e.g., on a par with speech sound information). But this does not follow. Although I learn facts about a medium via the signal it carries, I nevertheless continue to represent the medium as a medium. The same holds for voice: I learn how a voice sounds or whose voice it is by extracting information from a signal, but in so doing I continue to represent the voice as a medium through which the signal is communicated.

  11. 11.

    One might object that this evidence bears on the phenomenon of imagining speech and not on the phenomenon of inner speech. According to this objection, imagined speech involves the representation of another’s voice, while inner speech necessarily involves the representation of one’s own voice. However, it is not clear that the ‘own voice-other voice’ distinction marks a difference in mental state kinds; the objector must show that it does.

  12. 12.

    This paper has been concerned with the nature of the contents of inner speech, and not with the further, important question concerning the mechanism by which inner speech possesses its content (see Carruthers, 2010; Langland-Hassan, 2014; Vicente and Martínez-Manrique, 2016; Knappik, 2017). I believe that it is fruitful to first provide an account of the content of inner speech before fixing on one or another specific mechanism by which it is endowed with such content.

  13. 13.

    Many thanks to Edouard Machery, Peter Langland-Hassan, Wayne Wu, James Shaw, Mark Wilson, and Zina Ward for discussion and feedback on this paper.


