As explained above, understanding the control the user has over the workings of the BCI involves assessment of the user’s interaction with the device and the voluntariness and precision of their causal contribution to the process of producing synthetic speech. Given the contribution made to the output (or act) by the device, users will not have full control. In this section we follow others (Steinert et al. 2018) in distinguishing between different dimensions of control, as relevant to the processes under consideration.Footnote 2 This facilitates careful analysis of the aspects over which users exert control, and how fine-grained this control is.
Control Over ‘that Something Is Spoken’: Executory Control
The first feature of control we must consider is the user’s control over the event that something is said. Asking whether the user has this control is equivalent to asking whether and with what degree of regularity the production of synthetic speech at a given time is voluntary. We might also think of this as the user’s ability to choose whether and when to use the device.
This acting or not acting – control over when and whether one acts – requires what has been called ‘executory control’. Steinert et al. (2018, sec. 4.3.1) describe executory control as follows:
People have many desires, beliefs, and intentions on which they do not act. Something additional has to come in to realize such intentions: an executory command. Often, it is called a volition. But because the term is controversial and ambiguous, we rather speak of an executory command. From a commonsensical perspective, the idea seems plausible: Sometimes, it occurs to us as if we give a conscious command, a go-signal, to initiate an action, and that it causally executes the action.
We can, at least theoretically, separate two aspects to this notion of executory control. The first is the simple go-signal – that something is initiated, the second is the particular action selected for initiation. In standard cases of action, practically, the two are not separable. We cannot, as it were, command ourselves to perform a ‘potluck’ act. As Steinert et al say, we command an action. In the case of BCIs, however, these aspects of executory control may be more practically separable, so we may need to consider the mere commanding of the BCI, as well as the ability to command specific acts.
This will be highly important for speech BCIs. It will be critical to avoid instances of unintentional speech: if a private thought or mind-wandering moment were to generate brain activity sufficiently similar to that produced when engaging in covert speech, the device could record, decode, and ultimately make overt this instance. If this aspect of executory control were not afforded to the user, the voluntariness of their BCI-mediated speech would be significantly compromised.
Whether or not the user of the speech BCI is afforded this dimension of control – the go-signal – is the first question, the answer to which will depend on the ability of the device to distinguish between voluntary engagement in covert speech and something like the ‘inner thoughts’ or private inner monologue of the user. It is clear that this aspect of control will be crucial.
Control Over ‘What Is Spoken and How It Is Spoken’: Executory Goal Selection and Guidance Control
Determining that an instance of synthetic speech occurs is not the only dimension of control the user would want to have. In order to use the device to express themselves, users will also need to have control over what is spoken and how it is spoken. This shifts the focus to the second aspect of executory control: the action commanded or goal selected. It also introduces a second species of control: guidance controlFootnote 3, which is required in order to govern implementation – how an act or goal is pursued. This latter species of control also shifts the mechanical focus from the initiation of the process to the process itself.
Steinert et al. describe guidance control as follows (2018, sec. 4.3.3):
Sometimes, after initiation, people have control over the ensuing execution of movements, which is the ability to alter and influence the execution of movements, such as the trajectory of a limb movement […] Guidance control is often limited. For instance, we are not aware of the many muscles and their contractions that are necessary to raise an arm, let alone to perform more sophisticated actions such as skiing. Overall, humans lack fine-grained muscular control. Nonetheless, through various feedback channels, we monitor the progression of our movements and are able to adjust them, i.e., we could get them under conscious control, although only on general levels.
We can see that the importance of guidance control in the possibility it affords the user to intervene on and shape the action. We will argue that this particular aspect of control is more important for speech BCIs than BCIs for movement, although we will acknowledge that reasons of efficiency will direct away from simply maximising this type of control.
It seems in the case of speech BCIs, the user will need to be able to exert significant guidance control. That is, it will need to be continuously possible for the user to alter and influence the phrases produced. This is a consequence of there not being a clear goal that the user can command in the case of speech. The device will not be able to identify a command to say particular sentences independently from the user forming the words (although it might make predictions). However, the user will not have complete guidance control – over each word or even phrase spoken – due to the algorithmic prediction and correction that occurs as part of the processing.
This particular importance of guidance control for speech BCIs and apparent inseparability of the goal (what is to be said) from the process (saying what is to be said) is in contrast to motor BCIs. In the case of motor BCIs, the most important dimension of control will relate to carefully specified goal selection – whether these are more basic goals (turn left) or more complex goals (pick up object in front of me). As argued above, goal specification for continuous speech is not feasible – the content of the speech to be said becomes apparent through (covertly, and therefore also synthetically) speaking it.Footnote 4
The difficulty of identifying goals for speech independently of formulating the speech (at the least, engaging in covert speech) constitutes a mechanistic limitation that justifies emphasis of guidance control for speech BCIs. However, there is a further reason generated by the generally differing purposes of speech versus movement. As we suggested above, speech is expressive and indirectly instrumental, whereas movement is principally instrumental. Whilst this will not always be the case,Footnote 5 the preponderance of instrumental purpose for movement means that, very often, the process is less important to the user than fulfilling the goal. For example, with certain examples of motor BCI, the user will not care very much how the device moves in order to achieve the goal, as long as it does so in an efficient and unproblematic way. The user of a brain-actuated wheelchair, for example, may not mind which precise path the device takes to reach the other side of the room. Indeed, they may be particularly happy to offload this aspect of control, since the device may be able to operate more smoothly without the user exerting moment-to-moment guidance control. As noted above, the brain-actuated wheelchair discussed by Tamburrini (2009) includes two low-level behaviours– ‘obstacle avoidance’ and ‘smooth turning’ – that are governed by a behaviour-based robotic controller. That the user does not have to provide the commands for these herself makes it easier for her to move around space as she wishes (assuming she wants to avoid obstacles and walls).
Although there may be some cost in terms of sense of agency, with respect to motor BCIs, as long as the goal can be precisely defined, it will often be better if parts of the movement are on autopilot. Too much guidance control would compromise the user’s ability to achieve the goals she intends. Indeed, this reduction in guidance control may even increase agents’ global autonomy. In directing the BCI, the user still makes meaningful decisions, but the execution of the goals the user decides to pursue is rendered smoother and requires less effort.
In contrast, in the case of speech BCIs, users do care how the ‘goal’ is reached and ‘process’, or expression, is often more important than precise goal fulfilment. In parallel to our comments on semantic accuracy, it may not even be clear what the goal is. However, as outlined, devices for both speech and movement will share control of much of the process, making predictions about the user’s intended speech or actions.
Comparison with Predictive Texting
It is instructive to compare the speech BCIs we are discussing with a potentially parallel, and far more familiar, example of prediction in our communicative efforts. It might be thought that the systems we described are relevantly similar to software for predictive texting (Thurlow and Poff 2013). In this case, too, errors can be made, and prediction can go awry. However, there are a number of differences, which also serve to highlight the potential importance of the final type of control for speech BCIs.
We suggest that the individual using predictive text software has much more control over what is communicated than the user of a speech BCI, notwithstanding the algorithmic intervention in predictive text. First, the texter is far more aware of the ‘workings’ of the process than the speech BCI user: the texter can see the words as they are produced and corrected. Although corrections to words typed may occur automatically, often the texter is able to determine whether to select the next predicted word, which is not inserted automatically. Further, the words appearing in ‘draft’ are not the end of the process and do not constitute the act of communication. Taken together, these features of texting offer an opportunity to reverse predicted words or reject corrections, and the texter is able to gauge when a predicted or corrected word is the word they intended to type. As noted, the drafting of the text is not the act of communication itself, and although corrections and predictions can still be missed, the texter needs to make an additional go-signal – pressing send on the text – in order for it to transform into an act of communication. Unlike the speech BCIs envisaged, texting as communication does not occur in real time, whereas synthetic speech, when the user engages in it, will be continuous, without a ‘drafting’ stage. Finally, as we will explain in further detail below, this real time ‘hearing oneself’ could make it more difficult for the user to ascertain what they intended to say versus what the device may have predicted or corrected.
Veto Control: Additional No-Go Command
The above comparison shows that, although predictive texting can generate confusion in cases in which the texter does not pay attention, they do have a window to reject the contribution made by the algorithm. The requirement of an additional go-command makes this possible. Whilst this will not be possible for the user of a BCI for continuous speech,Footnote 6 an alternative mode of control could be enabled, which would allow the user to stop the production of speech in its tracks, halting the process.
This kind of control has been called veto control (Clausen et al. 2017, p. 1338; Mele 2009, p. 51ff; Steinert et al. 2018, sec. 4.3). Conceptually, there may be cases of standard action in which the control exerted is identical with process control: if I stop my arm just before I touch what I suddenly realise will be a hot stove, I exert control over the process that, until that point, had propelled my arm towards the stove. The question about whether veto control is distinct from process control is a matter of debate depending on how one views the mereology of processes (Mele 2009), and in cases where an agent has fine-grained and precise process control over an action, there may be a case for not drawing a distinction. However, in the case of a speech BCI (or, indeed, a BCI for movement) where more of the process is automated, it would make more sense to think of a veto command (and the control it affords) as distinct from ceasing or redirecting the process.
Regardless of the correct conceptualisation of the user’s intervention via a ‘no-go’ command, such a possibility would be valuable, allowing the user to retract synthetic speech as it was spoken. The more automated the process, the more valuable this will be.
Control : Preliminary Conclusions
To summarise the discussion of control so far, we have argued that technical accuracy facilitates (but does not guarantee) a high degree of control. Additional technical features, such as clear discrimination between neural activity associated with covert speech versus neural activity associated with mere thoughts, will be required to prevent unintentional vocalisation and maintain executory control.
We have also argued that better guidance control is likely to improve semantic accuracy (in so far as this can be assessed) and will facilitate ownership of speech. This is because the greater control the user has over the speech generated, the less it will be shaped by the device. However, we will now argue that even maximising guidance control will not in all cases guarantee ownership of the speech produced. Further, we have acknowledged that greater user control may come at the expense of efficiency, and so efficiency and sufficient ownership of speech may need to be carefully balanced.
In turning now to consider ownership of synthetic speech, we shift the importance from controlling the device or process itself to the user’s relationship to its outcomes.