Background

Clinical Needs

A speech brain-computer interface (BCI) is a method of alternative and augmentative communication (AAC) based on measuring and interpreting neural signals generated during attempted or imagined speech [1, 2]. The greatest need for speech BCI occurs in patients with motor and speech impairments due to acute or degenerative lesions of the pyramidal tracts or lower motor neurons without significant impairment of language or cognition. When movement and speech impairments are particularly severe, as in the locked in syndrome, patients may be unable to independently initiate or sustain communication and may be limited to answering yes/no questions with eye blinks, eye movements, or other minor residual movements. Significant advances have been made to assist these individuals through the use of other types of BCIs, including those using P300 [3], motor imagery [4], handwriting [5], and steady-state visually evoked potential [6]. However, these forms of communication cannot replace the speed and flexibility of spoken communication. The average words communicated per minute in conversational speech is more than 7 times that of eye tracking and 10 times of handwriting [7, 8]. Finally, speech allows patients to communicate with less effort as it is a more natural and intuitive modality for information exchange.

Invasive vs Non-invasive BCIs for Speech

Although non-invasive methods of measuring neural activity have been used as a BCI, no existing non-invasive recording method delivers adequate spatial and temporal resolution for use as a speech BCI. Imaging techniques such as functional near-infrared spectroscopy (fNIRS) and functional magnetic resonance imaging (fMRI) provide a delayed and indirect measure of neural activity with low temporal resolution, albeit with relatively good spatial resolution. Although magnetoencephalography (MEG) and electroencephalography (EEG) have adequate temporal resolution, they lack sufficient spatial resolution [9]. Moreover, MEG currently requires a magnetically shielded room, limiting its use to laboratory environments. Although EEG can be recorded at the scalp surface with electrode caps, these caps are cumbersome and require continued attention to electrode impedances to maintain adequate signal quality. Despite their limitations on resolution, fMRI, MEG, and EEG can provide expansive spatial coverage, which is advantageous when investigating the dynamics of the widely distributed language networks.

Because of the limitations of current non-invasive recording techniques, most work on speech BCI has been focused on using electrophysiological recordings of cortical neuronal activity with implanted electrodes of varying sizes and configurations [10]. These recordings have focused either on action potentials generated by single neurons or on local field potentials generated by populations of cortical neurons. Most advances in BCI research have arisen from techniques that record action potentials or related multi-unit activity from an ever-increasing number of microelectrodes. Until recently, the gold standard for these recordings used 2D arrays of up to 128 electrodes, each with single recording tips (Fig. 1). However, recent advances have allowed for up to 32 recording contacts along each implanted electrode, allowing even more single units to be recorded within a small volume of cortical tissue. Robotic operative techniques are also being developed to insert electrodes with less trauma to cortical tissue [11]. These techniques are designed to maximize the number of single units recorded per square millimeter of tissue. However, conventional wisdom that has the native cortical representations for vocalization and articulation during speech is widely distributed over most of the ventral portion of sensorimotor cortex in the pre- and post-central gyrus, and thus, any attempt to leverage these representations in a speech BCI will require recordings that can sample from a large surface area. Despite this, recent studies have shown the possibility of decoding speech from microelectrode Utah arrays implanted in dorsal motor areas [12, 13]. Stereo-electroencephalographic (sEEG) depth arrays have also been suggested as a promising recording modality for speech BCI (see detailed review in [14]). sEEG electrodes are thin depth electrodes surgically implanted through small holes in the skull, which makes them minimally invasive. These electrodes can support broader spatial coverage but are limited in their density.

Fig. 1
figure 1

High-density 128-channel (8 × 16) ECoG Grid. Photograph taken during subdural implantation. The electrodes are 2 mm in diameter and spaced 5 mm apart. Also visible in the figure are two 8 × 1 electrode strips with electrodes that are 4 mm in diameter and spaced 10 mm apart. Figure reused with permission from Ref. [33]

Electrocorticography (ECoG) uses 2D arrays of platinum-iridium disc electrodes embedded in soft silastic sheets that may be implanted in the subdural space to record EEG from the cortical surface (Fig. 1). The signals recorded with these electrodes are analogous to local field potentials (LFPs) recorded at larger spatial scales, which in turn depend on electrode size and spacing. ECoG recordings have been used extensively to identify the source of seizures in patients with drug-resistant epilepsy and to map cortical areas vital for brain function so that they may be preserved during resective surgery [15]. ECoG recordings in this patient population allowed the discovery of high gamma activity (~60–200 Hz) as a useful index of task-related local cortical activation [16], and subsequent studies in animals have shown that this activity is tightly coupled, both temporally and spatially, to changes in population firing rates in the immediate vicinity of recording electrodes [17, 18]. Indeed, differential changes in high gamma activity can be observed at electrodes separated by as little as 1 mm [19]. Thus, the surface area and spatial resolution of cortical representations that can be monitored with ECoG are limited only by the size and density of the electrode array used.

Target Population for Speech BCI

Because BCIs with adequate temporal and spatial resolution require surgically implanted electrodes, clinical trials of speech BCI devices are currently limited to patients with severe and permanent communication impairments, in whom the risk of surgical implantation can be justified by the severity of disability and a poor prognosis for recovery. The most pressing need for a speech BCI may be found in patients with LIS. Unlike patients who can rely on other means of communication, such as gestures and writing, LIS patients can typically only convey their thoughts through eye movements, eye blinking, or other minor residual movements. For patients with total locked in syndrome (TLIS) who have also lost the ability to control eye movement, this minimum means of communication is not even possible.

LIS is often caused by damage to the ventral pons, most commonly through an infarct, hemorrhage, or trauma, interrupting corticospinal tracts bilaterally and producing quadriplegia and anarthria [20, 21]. LIS can also be caused by degenerative neuromuscular diseases such as amyotrophic lateral sclerosis (ALS). In ALS, progressive weakness may result in LIS, especially if patients elect to have a tracheostomy and use artificial ventilation. Three categories of locked-in syndrome have been described: classic LIS where patients suffer from quadriplegia and anarthria but retain consciousness and vertical eye movement; incomplete LIS, in which patients have residual voluntary movement other than vertical eye movement; and TLIS, in which patients lose all motor function but remain fully conscious [22].

For LIS patients, anarthria arises from bilateral facio-glosso-pharyngo-laryngeal paralysis [23]. The cause of such paralysis in most LIS patients does not include speech-related cortical areas. Rather, anarthria reported in LIS patients usually results from interruption of neural pathways (corticobulbar tract) with loss of motor control of speech. Cranial nerve XII (hypoglossal nerves) controls the extrinsic muscles of the tongue: genioglossus, hyoglossus, styloglossus, and the intrinsic muscles of the tongue. These represent all muscles of the tongue except for the palatoglossus muscle [24]. Thus, lesions to neural pathways connecting cranial nerves XII produce a facial, tongue, and pharyngeal diplegia with anarthria, causing severe difficulties in swallowing and speech generation [25].

Another factor hindering speech function in LIS patients is impaired respiratory ability. Speech can be considered a sound exhalation and requires normal respiratory muscle strength. Normal speech requires active exhalation. Lesions of the ventral pons causing LIS not only impedes volitional behavior, but may also affect autonomous breathing [26].

Potential target populations for speech BCI also include patients suffering from aphasia. However, these patients often suffer from pathological changes in speech-related cortical regions, which would hinder the ability of a speech BCI to utilize natural speech circuitry for decoding [27]. While it is not impossible that the subject could be trained with a less natural neural control strategy, this extra challenge makes this population less suited for initial clinical trials.

Basic Principles of Operation

The underlying physiological support for a speech BCI is that distinct compositional features of speech can be represented by the weighted combinations of neural activity at subsets of recording electrodes [28]. Traditional BCI systems adopt techniques like linear discriminant analysis (LDA) to decode and classify speech into text before synthesizing audio through a conventional text-to-speech (TTS) application [29]. Recent studies have suggested the possibility of decoding neural signals directly using convolutional neural networks (CNN) to map high gamma activity recorded at different cortical sites onto speech features such as mel-spectrograms [30,31,32]. The decoded mel-spectrogram can then be used to recreate speech using a pre-trained neural network vocoder. The operation of a typical synthesis-based speech BCI is composed of four stages: recording of the raw neural signal, extraction of neural features from the raw signal, decoding of speech features from the neural features, and synthesis of audio from speech features (Fig. 2).

Fig. 2
figure 2

Basic principles of operation of a speech BCI. During speech, raw neural signals are recorded and processed in real time. A decoder will then map processed neural signals into auditory features or textual transcriptions. Decoded features are then synthesized into audio waveforms and can potentially be played in real time as auditory feedback

Decoding of Speech-Related Neural Signals

The neurophysiological mechanisms responsible for speech production rely on semantic, auditory, and articulatory representations in cerebral cortex. Activation of these cortical representations can be measured and decoded individually (see detailed review in [33]). The decoding of semantic meaning [34, 35] and gestural representations [36,37,38] alone, however, does not translate into comprehensible speech without additional decoding linking them to linguistic features. Here, we consider only the aspects of speech that can be directly used in communication: from phoneme, to word, to sentence. Along with the grammar of a given language, these sub-units constitute the linguistic aspects of speech and directly support the textual decoding of speech neural signals. We will discuss how the acoustic representation, articulatory trajectories, or textual representations of these linguistic features can be mapped to neural signals (Fig. 3). However, linguistic features are not the only mediums that carry useful information in conversational speech. Paralinguistic features, such as pitch, tone, intonation, and prosody, convey important information and can significantly modify the meaning of speech. Therefore, we will also discuss speech synthesis which requires decoding both linguistic and paralinguistic aspects of speech.

Fig. 3
figure 3

Targets of speech neural signal decoding. The acoustic representation, articulatory trajectories, and textual representations of phonemes, words, or sentences are all potential targets for speech neural decoding

Phoneme Decoding

Although decoding lower-level speech representations can potentially support the decoding of selected words with distinct semantic meaning, higher-level speech representation is preferred if the goal is to restore full conversational speech. One obvious candidate for decoding is the phoneme, the minimum distinguishable segment of speech. Early studies demonstrated the feasibility of phoneme-level neural decoding by classifying a limited set of vowels. Classification of 3 covertly articulated English vowels was achieved with up to 70% accuracy using spike data collected from an intracortically implanted microelectrode in a LIS patient [39]. Another study using a linear classifier trained on overt syllable speaking data collected from depth electrodes demonstrated 93% to near-perfect classification accuracy on 5 to 2 English vowels, respectively [40].

Similar classification on datasets consisting of a limited set of phonemes was also reported from ECoG studies. Blakely et al. [41] first demonstrated that phoneme pairs can be discriminated using ECoG data collected from a phoneme reading task. Pei et al. [42] classified 4 English vowels with Naïve Bayes classifiers trained on ECoG data collected in word repetition tasks, achieving 40.7% average classification accuracy for overt speech and 37.5% for covert speech. In the same study, they also showed above-chance decoding accuracy of four consonant pairs (leading and trailing consonants in a word). Ikeda et al. [43] adopted a linear classifier on 3 covertly articulated Japanese vowels, which were collected in an isolated vowel reading task. They were able to achieve a decoding accuracy of 42.2 to 46.7%. Apart from direct textual classification, linear classifiers were also used to decode acoustic formant features of 3 English vowels based on ECoG data collected from overt syllable reading [44]. Using spatiotemporal matched filters on ECoG data collected during overt isolated phoneme speaking tasks, Ramsey et al. [45] reported 75.5% decoding accuracy for 4 Dutch phonemes and rest. Milsap et al. [46] used similar spatiotemporal features in their neural voice activity detection study, successfully detecting all target keywords using neural templates trained from ECoG data from overt syllable reading tasks. Finally, one study also investigated the feasibility of classifying all English phonemes using ECoG data collected during overt word reading tasks, achieving 20.4% average decoding accuracy using LDA classifiers [47].

Word Decoding

Neural decoding may also target words, the smallest units of objective or practical meaning [48]. Relatively few studies have attempted to decode isolated words from neural data. For speech production studies, Kellis et al. [49] trained a linear classifier using ECoG data from overt word repetition tasks. They reported classification accuracy from 89.7 to 48% on vocabulary sizes from 2 to 10, respectively. Martin et al. [50] used support-vector machines for pairwise classification of words. They achieved 86.2% classification accuracy for overt speech production and 57.7% for covert speech using ECoG data recorded during repetition of isolated words. Apart from acoustic representation and textual classification, decoding of articulatory gestures from word-level speech neural data has also been investigated. Mugler et al. [38] used LDA and achieved 75% and 57.2% decoding accuracy (chance = 29.2% and 39.4%, respectively) for 13 articulator types in two subjects.

Sentence Decoding

As a self-contained and complete vehicle for speech, sentence is the core unit of language interpretation [51]. Several advantages come with decoding whole sentences: (1) It is a more natural paradigm for communication; (2) The broader spatial distribution of sentence-level speech and richer temporal information could offer more information for decoding. (3) The incorporation of language models can increase the decoding accuracy. Recent years have seen the growing popularity of sentence-level decoding. Martin et al. [28] successfully reconstructed spectro-temporal features of sentence-level speech from ECoG recordings of both overt and covert sentence reading. Moses et al. [52] used LDA to classify sentence-level speech perception data in real time. They proposed both a direct classification approach where sentence-level neural activities were used to train the decoder and a continuous phoneme classification approach similar to their previous method in [53]. Herff et al. [54] developed one of the first functioning systems transcribing neural activities during overt sentence production into textual output. They used Bayesian update [55] to combine a statistical ECoG phone model with a language model and predict the most likely sequence of words. The word error rate (WER) of this system ranged between 60 and 15% for vocabulary sizes between 100 and 10, respectively. More recently, deep learning architectures for automatic speech recognition (ASR) were used in sentence-level neural decoding, significantly improving decoding accuracy. Makin et al. [56] used Encoder-Decoder recurrent neural networks (RNNs, [57]) to make sequence-to-sequence predictions. In contrast to other studies mentioned earlier in this section, this study mapped neural activities recorded from ECoG grids into word sequences instead of phoneme sequences. They achieved a 3% WER for a single participant with a vocabulary size of about 250. In another end-to-end sentence decoding study, Sun et al. [58] proposed a deep learning architecture consisting of a neural feature encoder network trained to extract spatiotemporal neural features, feature regularization networks trained to force meaningful representation in latent space, and a text decoder network trained to minimize alignment-free connectionist temporal classification loss. With a language model, their study achieved a WER of 10.6%, 8.5%, and 7.0% on three different subjects with vocabulary sizes from 1200 to 1900. Recently, Moses et al. [59] successfully achieved online sentence decoding using chronically implanted 128-channel ECoG grid. Training data was collected during attempted unintelligible overt speech from a participant with quadriplegia and anarthria resulting from a pontine stroke. To decode sentences, they first trained a neural network to detect the individual word in speech. Subsequent neural networks were trained to classify detected words into one of the 50 words from the limited vocabulary set used in the study. The accuracy of the classification model in offline analysis was 47.1% (chance accuracy was 2%). Two additional models were used in sentence decoding. The first was a language model that predicted the probability that a word would occur in the English language given the sequence of words preceding it. This model was trained on a custom dataset consisting sentence sequences constructed with words from the aforementioned 50-word set. The second was a Viterbi decoder that combined the probability from the language model and the word classification model to make a final prediction. The study achieved real-time sentence decoding with a median WER of 25.6% (chance WER was 92.1%). The median number of words decoded per minute was 15.2. Overall, these studies demonstrated the feasibility of transcribing neural data into textual output.

Speech Synthesis

One of the challenges in developing classification-based decoding methods is the variability of speech. Even for a single speaker, speech signals are impacted by the rate of speech, coarticulation, emotional state, and vocal effort [60, 61]. To produce textual output, decoding models need to be robust to these variabilities. At the same time, some of these variabilities carry linguistic meaning and constitute an essential part of natural speech. For example, prosody and intonation are often used for conveying humorous or satirical intents, as are pauses and varying rates of speech for emphasis. By directly mapping speech neural signals onto acoustic speech or speech-related features, researchers have been able to preserve these non-representational and paralinguistic aspects of natural speech. Herff et al. [62] proposed a method to improve on previous classification studies. They used a pattern matching approach for neural activities and concatenated the corresponding ground-truth speech units to generate continuous audio. Their unit-selection model was trained on small sets of ECoG data (8.3 to 11.7 min) and simultaneous audio recordings during overt speaking tasks. This study demonstrated that intelligible speech could be generated using models that were less demanding on computing resources and that were trained on limited sets of data.

The use of deep learning models significantly improved the performance of synthesis-based speech BCIs. Using data recorded from ECoG grids and stereo-electroencephalographic (sEEG) depth arrays during speech perception, Akbari et al. [63] showed intelligible synthesis of sentences and isolated digits using a standard feedforward network mapping ECoG high gamma, as well as low-frequency signal features, to vocoder parameters, including spectral envelope, pitch, voicing, and aperiodicity. They achieved a 65% relative increase in intelligibility over a baseline linear regression model.

Recently, studies based on deep learning methods also demonstrated the feasibility of synthesis from speech production data. Angrick et al. [30] showed that high-quality audio of overtly spoken words could be reconstructed from ECoG recordings using two consecutive deep neural networks (DNNs). Their first DNN consisted of densely connected neural networks [64] and mapped neural features into spectral acoustic representations. These speech representations were then reconstructed into audio waveforms by WaveNet [65], a secondary vocoder DNN. Anumanchipalli et al. [32] reconstructed spoken sentences from ECoG data using two recurrent bidirectional long-term short-term memory networks (bLSTM) [66]. Their first bLSTM mapped ECoG high gamma activity onto vocal tract trajectories (inferred statistically) from speaking full sentences. The second bLSTM then inverted the trajectories to acoustic speech features. Finally, an HMM-based excitation model synthesized speech waveforms based on these speech features [67]. They also showed that their network generalized to unseen sentences and to silently mouthed speech without vocalization. In both studies, neural activities were mapped first into intermediary speech representations, from which speech waveforms were subsequently reconstructed. Both studies showed reasonable speech reconstruction using relatively small amounts of data. Angrick et al. [30] used datasets between 8.3 and 11.7 min, and Anumanchipalli et al. [32] showed robust decoding performance with a minimum of 25 min of data. The fact that both studies were able to achieve intelligible speech synthesis with limited data size with the incorporation of intermediary speech representation might point to the particular usefulness of leveraging speech-adjacent features to train models in data-limited settings. A recent study by Kohler et al. [31] also examined the possibility of using an encoder-decoder sequence-to-sequence model to predict spectral acoustic representation from sEEG signals collected during overt speech. Audio waveforms were then reconstructed using a WaveGlow [68] vocoder. Together, these findings demonstrate the strong potential of neural networks in decoding and synthesizing speech neural data. These studies also suggest the benefits of having consecutive neural networks with distinct roles bridging neural activities, intermediary speech representations, and eventually auditory speech reconstruction.

Vocoders for Speech Synthesis from Neural Signals

One key component of synthesis-based speech BCI is the vocoder, which generates a natural-sounding human voice either from textual representations or acoustic features, depending on the targets for neural decoding. A text-to-speech system (TTS) is often used to synthesize speech acoustic waveforms from word or sentence-level textual input, making it suitable for providing auditory feedback after textual transcription has been decoded from neural signals. Early TTS systems relied on unit-selection approaches, concatenating small segments of speech to generate continuous waveforms [69, 70]. Herff et al. [62] took the same unit-selection approach for speech synthesis in their neural decoding study but bypassed the intermediate text representation. Subsequently, statistical parameter speech synthesis (SPSS) grew in popularity. SPSS models map linguistic features from text into intermediate acoustic features and reconstruct speech waveforms from these features [71]. Supplied with textual input from neural decoders, these models have been used to generate audio outputs.

Crucially, SPSS models can be used not only in the vocoding of textual output, but also acoustic features decoded from neural signals. Provided the same acoustic representation is used in training the neural decoder and SPSS model, acoustic waveforms can be reconstructed directly from the vocoder component of the SPSS models. Vocoders used in SPSS can generally be divided into two categories, autoregressive (AR) probabilistic models and phase estimation models. Both have benefited from incorporating deep learning techniques. For phase estimation vocoders, the classic Griffin-Lim algorithm (GLA) is still one of the most used. GLA inverts the spectrogram based on the redundancy of short-time Fourier transformation [72]. A faster algorithm inspired by GLA has also been proposed, improving both the quality and the speed of the original algorithm [73]. A GLA vocoder was used in one neural decoding study to reconstruct speech waveforms from quantized spectrograms predicted from sentence-level ECoG data [74].

In contrast to phase estimation models, classical AR models have attempted to synthesize speech by finding parameters for the source-filter model (see detailed review in [75]). An HMM-based excitation model [67] was used by Anumanchipalli et al. [32] to reconstruct acoustic waveforms from intermediate acoustic features predicted from their neural decoder. In recent years, several deep-learning–based AR vocoders have also shown great promises for use in speech BCI, including WaveNet [65], SampleRNN [76], WaveRNN [77], and LPCNet [78]. WaveNet was used in one neural decoding study to reconstruct auditory waveforms from decoded acoustic representations [30] and one speech synthesis study based on electromyography [79]. WaveGlow, a flow-based method without the need for autoregression inspired by Glow and WaveNet, has also been proposed [68]. Recently used by Kohler et al. [31] for offline synthesis in their sEEG-based speech decoding system, it could be a promising candidate as a vocoder in a real-time speech BCI system due to its fast inference time.

Challenges and Future Directions

Chronic ECoG

Most of the speech BCI studies we reviewed above have been based on acute or short-term ECoG recordings for clinical purposes, mostly for surgical treatment of drug-resistant epilepsy, but also brain tumors [15, 80]. Long-term ECoG signal stability for speech decoding has not yet been fully investigated. However, motor BCI research based on long-term ECoG signals have demonstrated reliable decoding from chronic implants [81, 82]. The safety and stability of ECoG implants in individuals with late-stage ALS have also been reported. For over 36 months, the motor-based system maintained high performance and was increasingly utilized by the study participant [4, 83]. In one study using the NeuroPace RNS System with sparse electrode coverage, long-term stability of speech-evoked cortical responses was observed [84]. A recently published study examined the feasibility of speech decoding using a chronically implanted 128-channel ECoG grid. The study lasted 81 weeks with 50 experimental sessions conducted at the participant’s home and a nearby office. The authors reported that the ECoG signals collected for this study were stable across the study period for decoding purposes [59]. Beyond the aforementioned studies, the safety of long-term ECoG implantation has been established by multiple studies in non-human primates [85, 86]. These studies indicate that a chronic ECoG implant for speech BCI should be safe and should provide stable signal quality.

Real-time Speech Decoding and Synthesis

Assistive speech BCI systems for patients with LIS need to operate in real time with reasonably low latency. For systems designed to provide a classification-based selection or textual transcription, the latency can be longer at the expense of the information transfer rate [87]. Studies have shown that a real-time ECoG classification system is indeed feasible for sentence-level speech perception [52] and overt phrase/word-level speech production [88]. The drawback of such system is the lack of immediate auditory feedback, which plays an important role in the speech production process [89], and the lack of other expressive features of spoken acoustics.

For patients with LIS, a speech BCI system capable of providing real-time auditory feedback could be very useful. Timely sensory feedback, though artificial, can allow users to make adjustments in vocoding efforts and to detect and correct errors. Although individuals retain the ability to produce intelligible speech years after loss of hearing, their speech deteriorates over time due to the lack of feedback [90,91,92]. Even though most LIS patients retain intact hearing [21], the same deterioration of speaking abilities might occur due to the absence of self-generated speech, and consequently, feedback from it. More importantly, since speaking with a synthesis-based BCI system is significantly different from speaking prior to loss of function, recalibration or even relearning of speech production is needed, thus requiring real-time auditory feedback [93, 94]. Although no ECoG-based online speech synthesis has yet been reported, several studies have explored closed-loop speech synthesis using neurotrophic electrodes [39], stereo-electroencephalography [95], and electromyography [96], to varying degrees of intelligibility.

For synthesis-based speech BCIs aiming to provide auditory feedback, latency must be kept at a minimum to avoid disruption of speech production. Previous evidence suggests that acoustic feedback at a 200 ms latency can disrupt adult speech production [97]. Although slow and prolonged speech can be maintained at longer delays than 200 ms, shorter delays are needed in fast-paced natural speech [98, 99]. Studies in delayed auditory feedback have found that delays less than 75 ms are hardly perceptible to speakers, and fast-paced speech can be maintained with such delay, while optimal delay is less than 50 ms [100,101,102].

Decoding Silent Speech

Many of the studies we reviewed here were based on overt speech production in which subjects clearly enunciated their speech and produced normal acoustic speech waveforms. This acoustic output can be critically useful for training speech decoders and for providing ground truth when attempting to segment neural signals that correspond with spoken words, phrases, or sentences. However, for patients who are locked-in, overt speech production is severely impacted, if not outright impossible. Therefore, speech BCI systems for patients with LIS may need to be trained on and decode silent speech. Speech can be silent either because no attempt is made to phonate or articulate (covert speech) or because articulation occurs without phonation (mimed speech). In patients with different degrees of paralysis of the muscles for phonation and articulation, speech may be silent even though the patient is attempting to phonate and/or articulate (attempted speech). In overt speech studies, training labels are easily obtainable during neural recording sessions in the form of simultaneous audio recording. For silent speech, experimental paradigms need to be carefully designed for subjects to vocalize with predictable and precise timing. Such experiments are even more challenging with LIS patients, who have difficulty in giving feedback, verbally or otherwise.

Compared to decoding overt speech, silent speech not only fails to provide a ground truth for training but may also produce different patterns of cortical activation. Indeed, most studies of covert speech have shown that it is accompanied by far less cortical activation than overt speech. Moreover, the cortical representations of covert speech may differ from those of overt speech, making it more difficult to adapt successful decoding methods from overt studies to use in LIS patients [103, 104]. Despite these challenges, multiple studies have shown success in phoneme [42, 43], word [50], and sentence classification [28] from ECoG signals (see detailed review of covert speech decoding in [105]). Moreover, in patients with paralysis of speech musculature, cortical activation during attempted speech is comparable to that observed during overt speech in able normal subjects [106]. In addition, progress has been made in synthesizing speech from silently articulated speech (mimed speech) in which subjects move articulators without vocalization [32]. A closed-loop online speech synthesis system based on covert speech has also been proposed [95]. However, online speech synthesis with reasonable intelligibility from silent speech has not yet been achieved at the time of this review.

Conclusions

This review summarizes previous studies on speech decoding from ECoG signals in the larger context of BCI as an alternative and augmentative channel for communication. Different levels of speech representations: phonemes, words, and sentences may be classified from neural signals. Emerging interest in adopting deep learning in neural speech decoding has yielded promising results. Breakthroughs have also been made in directly synthesizing spoken acoustics from ECoG recordings. We also discuss several challenges that must be overcome in developing a synthesis-based speech BCI for patients with LIS, such as the need for a safe and effective chronically implanted ECoG array with sufficient density and coverage of cortical speech areas, and a real-time system capable of decoding covert or attempted speech in the absence of acoustic output. Despite these challenges, progress continues to advance toward providing an alternate method of speaking for patients with LIS and other severe communication disorders.