In the present study, we investigated the temporal prediction of multisensory consequences of one’s own action, using unimodal and bimodal visual and auditory stimuli, presented at various delays after a buttonpress. In Experiment 1, in which the buttonpress was self-initiated, we observed an advantage for bimodal trials, as shown by a significant reduction in detection thresholds (PSEs). This advantage was especially evident when the task-irrelevant modality was time-congruent with the action. In Experiment 2, in which the buttonpress was either self-initiated (active) or externally generated (passive), we replicated these results for the active trials. In passive trials however, general performance was reduced as compared to the active condition. Importantly, we found a significant delay × action interaction, with the largest difference in performance between active and passive trials when the task-irrelevant modality was time-contiguous with the action. Thus, we found a specific enhancement close to the action: the further the task-irrelevant stimulus was away from the action, the less enhancement it provided as compared to passive trials. These results suggest that the forward model generates predictions for multiple modalities: When a second, task-irrelevant, modality was present, performance was enhanced relative to when only the task modality was present. The enhancement was largest when the task-irrelevant modality was not delayed—that is, congruent with the action, and thus in line with the internal temporal prediction—as compared to when the temporal predictions for both modalities were violated. These results are specific to the active condition, which could either mean that our results are due to efference copy mechanisms, or alternatively, that our effects are due to differences in temporal predictability of the stimuli. In active conditions, participants had a better temporal prediction about the upcoming stimuli, since they decided themselves when they pressed the button. In passive trials however, the button was pulled down at variable jittered time points, increasing the uncertainty about the temporal occurrence of the consequent stimuli. This difference in predictability reflects the natural characteristics of actively and passively generated stimuli in daily life, in which passively generated stimuli are generally less predictable than actively generated stimuli. Nevertheless, as has been described by Hughes et al. (2013), due to this predictability difference we cannot disentangle whether the effects are due solely to efference copy mechanisms, differences in temporal predictability, or both. In the present study, it was not our aim to specify the exact source of differences between active and passive actions, but rather, to investigate differences in processing unimodal versus bimodal action consequences generated either actively or passively. Which specific aspect of predictive mechanisms in voluntary action is responsible for our results would be an interesting topic for future studies.
Our results suggest that our internal model generates temporal predictions for at least two modalities. This is in line with previous studies with unisensory stimuli showing evidence for temporal predictions in the visual, auditory, and somatosensory system, however separately tested. For example, it has been widely accepted that presaccadic activity, a potential efference copy signal, predicts the visual consequences of saccades, thereby ensuring visual stability (e.g., Sommer & Wurtz, 2004; von Holst & Mittelstaedt, 1950/1971; but see Bridgeman, 2007). Furthermore, our internal model also plays an important role in predicting the visual consequences of more complex actions, such as various hand movements (Hoover & Harris, 2012; Knoblich & Kircher, 2004; Shimada, Qi, & Hiraki, 2010). Although many studies have focused on the role of the forward model in predicting visual consequences of actions, several studies have also shown its importance in predicting tactile (Blakemore et al., 1998) and auditory (Curio, Neuloh, Numminen, Jousmaki, & Hari, 2000; Ford et al., 2005) consequences. This suggests that predicting the outcome of our actions in different modalities is based on similar mechanisms. However, the mechanism behind the temporal prediction of multisensory consequences was unclear. To our knowledge, our study is the first to specifically investigate multisensory action predictions. Some studies have investigated the role of action expertise on the perception of multisensory actions, by showing drumming point-light displays with manipulated audio-visual information to experienced drummers and novices (Petrini et al., 2011; Petrini, Russell, & Pollick, 2009). They found that experienced drummers are better at detecting stimulus asynchronies than novices. Although these studies show that one can acquire internal models of action, which can aid in perceiving ambiguous multisensory information, they did not directly test the prediction of multisensory consequences of one’s own action. One previous study showed that unpredicted visual stimuli affected loudness perception of auditory stimuli, both for self-generated stimuli and stimuli predicted by a cue (Desantis, Mamassian, Lisi, & Waszak, 2014). However, this study investigated the general cross-modal effect of predictability of task-irrelevant stimuli on the perception of the task stimuli. In our study, we were specifically interested in the temporal prediction of multisensory action consequences. Few other studies have included multisensory action consequences; however, they were used to study the sense of agency. For example, Farrer and colleagues found that the presentation of a sound at the time of the buttonpress significantly reduced the thresholds at which participants felt in full control of the appearance of the visual stimulus (Farrer, Valentin, & Hupé, 2013). Similarly, lower thresholds were found when additional tones were presented at the time of the buttonpress and visual stimulus in a cross-modal grouping paradigm with variable delayed visual stimuli (Kawabe, Roseboom, & Nishida, 2013). Although the sense of agency may rely, at least partly, on the same forward model used in sensorimotor predictions (Farrer, Frey, et al., 2008), it has also been suggested to emerge from an additional interpretive mechanism (Kawabe et al., 2013). In our study, we specifically investigated multisensory temporal predictions of the forward model by using a delay detection task instead of assessing the subjective experience of the action outcome. Furthermore, our results go beyond these previous findings by manipulating the delay of both modalities, which revealed the additional effect that facilitation occurs specifically when one modality is congruent with the action. Nevertheless, a recent study by Kawabe (2015) had a quite similar paradigm. Although Kawabe’s study was presented as investigating the sense of agency, the paradigm of their first experiment involved a delay detection task instead of an explicit agency judgment task. They recorded a hand movement and showed this visual feedback at various delays. Their results are partly in line with ours; in bimodal trials, delay detection was best when the task-irrelevant modality was not delayed. Nevertheless, no bimodal enhancement was found; performance in the bimodal condition with the nondelayed task-irrelevant modality was just as good as in the unimodal condition. Our and their bimodal effects are congruent, but our bimodal enhancement differs from their results. However, their study differs in some points from ours: first of all, Kawabe was interested in the general effect of delayed visual feedback on the sense of control over the auditory stimulus, ranging from no visual feedback to extremely delayed visual feedback, and thus had much less unimodal trials than bimodal trials. Furthermore, they showed recordings of the hand as visual feedback. Thus, their visual feedback was of the action itself, whereas the auditory stimulus was presented at or after the buttonpress. In our study, visual and auditory stimuli were both presented at or after the buttonpress. Importantly, our study included a passive condition in Experiment 2, in which the action was externally generated, whereas Kawabe’s study did not include a passive condition. Thus, we partly replicate but also extend the findings from Kawabe’s study by being able to link our results more closely to mechanisms involved specifically in voluntary actions.
Several studies on voluntary actions and their outcomes have observed a so-called “intentional binding effect” (Haggard, Clark, & Kalogeras, 2002). They found that participants judge the perceived time of their action closer to the time of the outcome, and the outcome closer to the action. In other words, the perceived times of action and outcome are attracted toward each other. For externally generated actions, the opposite effect can be seen. It has therefore been argued the forward model in voluntary actions helps to bind the outcome to our action to maintain a sense of agency (Haggard, Clark, & Kalogeras, 2002). Although we did not specifically measure intentional binding, our results are in line with this phenomenon to a certain extent. First of all, we found bimodal enhancement, which has previously been found as well by Kawabe et al. (2013). Furthermore, for bimodal trials we see a specific advantage when the task-irrelevant modality is not delayed; with increasing delay of this stimulus, delay detection becomes worse. Similar effects have been found with unimodal action outcomes by Wen and colleagues, who found stronger intentional binding effects with increasing delays between action and outcome (Wen, Yamashita, & Asama, 2015). However, whereas the intentional binding effect is usually seen during voluntary actions, and repulsion of perceived time of action and outcome in externally generated actions, we found the opposite pattern: Delay detection performance was generally better in active than in passive conditions. In that sense, our results are more in line with various studies that have observed a similar enhancing effect in voluntary actions (Hoover & Harris, 2012; Shimada et al., 2010). The reason for this discrepancy cannot be resolved from our results. However, one should take into account that our paradigm differs in several aspects from traditional intentional binding studies. As we mentioned before, we cannot specify whether it is the efference copy or temporal control that can explain our results best—an issue present in intentional binding studies as well (Desantis, Hughes, & Waszak, 2012; Hughes, Desantis, & Waszak, 2013).
It should be taken into account that the order of our stimuli or the stimulus asynchrony could have played a role in our results. The bimodal advantage in the active conditions was largest when the stimulus in the task-irrelevant modality was not delayed—that is, congruent with the action. In these cases, the task modality comes after the task-irrelevant modality. Recognizing this asynchrony between the two stimuli may have helped the participants, as it indicates that the second stimulus, in this case the task stimulus, must be delayed. Indeed, humans can detect very short stimulus asynchronies (Spence, Shore, & Klein, 2001; Zampini, Shore, & Spence, 2003), suggesting that an order effect could play a role in this study. However, trials with a delayed task-irrelevant modality, in which case the task modality comes first, and trials with a nondelayed modality, in which the task modality comes after the task-irrelevant modality, both show significantly lower PSE values than unimodal trials. Importantly, in the passive condition, PSE values were lowest both when the task-irrelevant modality was not delayed and when it was delayed by 417 ms. In fact, the nondelayed condition and the condition with a delay of 417 ms did not differ at all. Thus, even if the order of the stimuli may have helped the participants, it cannot fully explain our results. First of all, multisensory facilitation was also found in the active task when the task modality was presented first. Secondly, the advantage for the nondelayed task-irrelevant modality was specific to the active task.
Another point that should be noted is that our stimuli had different durations in the bimodal conditions. We gave our stimuli the same offset, so the stimulus that was presented second had a shorter duration than the first stimulus, which always had a duration of 1,000 ms. Because we specifically wanted participants to judge the delays between buttonpress and the onset of the task stimulus, we decided to present the stimuli for this long duration and give the stimuli the same offset, so the only useful information to perform the task would be the onset of the stimuli, not the offset or their asynchrony. One could argue that different stimulus durations are not optimal. However, if the stimuli would have been presented for 1,000 ms each, the total duration of stimulus presentation would have varied, ranging from 1,000 to 1,417 ms, depending on the delay of the second stimulus. In addition, there would have been both an onset and an offset asynchrony, which participants could have used to enhance their performance. Therefore, we decided that it would be a better solution to keep the total stimulus duration the same, with the same offset for both stimuli. We cannot rule out completely that different stimulus durations would have had an effect on our data. However, Experiment 2 showed that different stimulus durations cannot fully explain our effects, since we do not see a specific enhancement for the 0-ms delay of the task-irrelevant modality in the passive condition.
Although the stimuli in the active conditions of our experiments should have been perceived as self-generated, one could argue that some stimuli with a long delay would be perceived as externally generated. This would mean that in some bimodal trials, one stimulus would be perceived as self-generated and the other as externally generated, which makes the results hard to interpret. However, it seems unlikely that some stimuli would have been considered externally generated. A study by Farrer et al. (2013) investigated the feeling of agency, and how this is influenced by various factors. Their main paradigm was similar to ours: Participants had to press a button, which would elicit the presentation of a dot on the screen after a variable delay. Instead of having to report whether they detected a delay, participants were asked to report whether they felt they were in control of the appearance of the dot, answering either with “full control,” “partial control” (“I detected a delay but still feel I was the agent”), and “no control.” Their results show that the feeling of “no control” arises after at least 628 ms or more, a much longer delay than our longest delay of 417 ms in all manipulations. Furthermore, participants in that study were told that they were not always the agent, and that the dot was sometimes controlled by the computer, making them perhaps even more prone to sometimes report “no control.” In our study, participants were explicitly told that they were always the agent. Taken together, it seems unlikely that some stimuli would be considered externally generated in our study.
Our results could also be explained by using the more general framework of predictive coding, or the free-energy principle. In this framework, the brain is regarded as a Bayesian inference machine, updating beliefs or expectations using sensory information (Friston, 2010; Picard & Friston, 2014). According to this theory, biological agents resist a natural tendency to disorder, requiring them to minimize the amount of surprise, or free energy, associated with encountered events. They manage to minimize free energy, or prediction error, through both perception and action. Perception can minimize prediction error by updating the predictions, action can do so through active inference, which changes actual sensory input to fit the predicted sensory input (Friston, 2010). An important distinction with the forward-model theory is that in the free energy formulation, movement is not driven by motor commands, but by predictions about the proprioceptive consequences of that movement. Thus, prediction errors are not there to refine motor behavior, but themselves are the motor signals (Friston, Daunizeau, Kilner, & Kiebel, 2010). In both theories, the efference copy plays an important role. However, whereas the forward model describes the efference copy as a copy from the motor signal, that is used to predict the sensory outcome of an action, the active inference account sees the efference copy as a bottom up prediction error signal used to optimize predictions. In our experimental approach, we specifically manipulated the efference copy presence by including active and passive conditions. Thus, comparing the forward model theory with the free energy principle is beyond the scope of this study, but would certainly be interesting for future studies.
Furthermore, future studies should transfer our findings to more natural contexts with a focus on realistic action outcomes (e.g., videotaped hand movements) and not only abstract action outcomes, as in our study. However, in a world in which we are surrounded by computers and other devices, it is a common action to press a button and expect a visual and/or auditory consequence, such as when typing a letter or playing a game. Thus, despite the setup being fairly abstract, it can still be considered ecologically valid.
All in all, we have shown that multisensory consequences of our own action lead to enhanced detection of delays between action and feedback. Furthermore, we have shown that timing matters: When the task-irrelevant modality is not delayed—that is, is time-contiguous with the action—performance is better than when the temporal predictions of both modalities are violated. This effect was specific to trials with voluntary action. Our findings point toward the idea that one forward model creates multisensory temporal predictions. Alternatively, it could be that separate forward models are used for each modality, which tightly interact. With our results, these two interpretations cannot be distinguished; future studies using functional magnetic resonance imaging or electroencephalography might be able to disentangle these two possibilities. However, our study is an important first step in unraveling the nature of the temporal prediction of multisensory action consequences. The results support the idea that the forward model provides predictions for all modalities and consequently contributes to multisensory interactions in the context of action.