We can watch an object move from one position to another, remember the motion, and reproduce the object’s change in position by means of our hand movements (Posner, 1967). In the cross-modal uses of memory, where input modality at the time of memorization and output modality at the time of reproduction differ, some mechanisms are thought to be at work in the appropriate transfer of information between the input and output modalities (Freides, 1974). To understand the mechanism of this information transfer, previous studies of cross-modal memory have focused on the representation maintained and, in particular, on whether information is maintained as modality-nonspecific or modality-specific (e.g., Connolly & Jones, 1970; Diewert & Stelmach, 1977; Fishbein et al., 1977; Marteniuk & Rodney, 1979; Newell et al., 1979).

Studies on spatial working memory focusing on the rehearsal function of spatial information have suggested the possibility that modality-unspecific representations are maintained and examined whether the rehearsal function is performed by the eyes and hands or by one or the other (e.g., Lawrence et al., 2001). These studies rarely mention the modality involved in encoding and test timing, which is an issue in cross-modal memory studies, and do not examine pure cross-modal memory situations. However, for maintenance representations used in situations involving the eyes and hands, the dual-task method has yielded consistent research results.

In the reviewed studies, the primary task involves sequentially reproducing a visually presented object’s position by hand while keeping the eyes open after tracking the object’s position. In this recall task, subjects only use their eyes for encoding and eyes and hands for testing. To identify maintenance representation related to task performance, subjects perform interference tasks as a secondary task separate from the recall task using their eyes or hands. Results consistently report interference effects for eyes (Baddeley, 1986; Lawrence et al., 2001; Pearson & Sahraie, 2003; Postle et al., 2006) and hands (Della Sala et al., 1999; Lawrence et al., 2001; Pearson & Sahraie, 2003; Quinn & Ralston, 1986; Salway & Logie, 1995; Smyth et al., 1988; Smyth & Pendleton, 1989). These results suggest that information from the eyes is stored as modality-unspecific representation that can be rehearsed by either the eyes or hands (e.g., Lawrence et al., 2001). This theory has been generally accepted in the research on spatial working memory.

However, a modality-specific hypothesis (Connolly & Jones, 1970), in which information inputs from the eyes and hands are maintained as input-modality-specific representations, has also been proposed. In the studies that formed the background of this hypothesis, relatively pure cross-modal memory situations have been examined (Connolly & Jones, 1970; Jones & Connolly, 1970). Connolly and Jones (1970) used a task wherein the participants moved linearly moving blocks (moving distances: 4, 6, 8, 10, and 12 inches.) by eye or by hand, and then recalled the distances. In the case of visual recall, the participant said “stop” when the block moved by the experimenter reached the previously presented distance, and in the case of kinesthetic recall, the participant moved the block to the previously presented distance themselves. The results showed that K–V, where the input modality was K (kinesthetic) and output modality was V (visual), was more accurate than V–K. The better performance of K–V, which uses kinesthetic inputs, is consistent with Posner’s (1967) proposal that kinesthetic input is more efficient storage than visual input. Therefore, it was believed that the asymmetry in the results between V–K and K–V may be maintained in input-modality-specific representations. However, Connolly and Jones (1970) did not adopt the dual-task method, and that stable research results on the asymmetry between V–K and K–V have not been obtained is considered a drawback (Diewert & Stelmach, 1977; Marteniuk & Rodney, 1979). Therefore, this hypothesis requires further examination due to the lack of empirical evidence for general acceptance.

Which of these two different hypotheses is more valid in cross-modal memory remains unclear. Spatial working memory studies have consistently used the dual-task method and reported stable results supporting the modality-unspecific representation theory. Conversely, studies using pure cross-modal situations have proposed the modality-specific theory but have not produced stable results to support it. If we reassess these studies in terms of the progress of cross-modal memory research, it would be worthwhile to examine which of these two hypotheses is correct by reestablishing a pure cross-modal memory situation and examining the maintained representations using the dual-task method.

Such attempts have been made in recent years. Seemüller et al. (2011) investigated interference effects using a dual-task method in four task situations: intramodal (V–V[ = visual], K–K[ = kinesthetic]), wherein the modality used during encoding and testing is identical, and cross-modal (V–K, K–V), wherein they differ. Visual stimuli were presented as a moving white light spot. Kinesthetic stimuli utilized a device in which the hand moved passively and presented the angle of movement. After presenting the target stimuli, one of three types of interference tasks was performed with a 6-second delay period (the immediate condition was performed after 400 ms and delay condition after 3.2 s). Subsequently, test stimuli were presented, followed by recognition of their similarity with target stimuli. The interference conditions included visual-interference, kinesthetic-interference, and noninterference conditions established through delays within the memory task. The memory performances between the three conditions were compared. The results of the experiments indicated that the eyes caused selective interference in the memory performance of the V–V and V–K tasks, in which the eyes were used during encoding. Similarly, selective interference by the hand was observed in the memory performance on the K–K and K–V tasks, in which the hand was used during encoding. This suggests that information was maintained in a format specific to the input modality during encoding, regardless of whether it was intramodal or cross-modal. A selective interference effect was identified in the analysis focusing on an input modality but absent in the analyses focusing on the test modalities (VV and KV, KK and VK). This result suggests that information may not be maintained in a form specific to the output modality of testing, regardless of whether it is intramodal or cross-modal. These results indicate that in this study, information is always maintained in terms of an input-modality representation at encoding, whereas at the same time, in the cross-modal storage, the representations are transformed at testing to pass the information to a different modality that is then used for output.

Notably, Seemüller et al. (2011) mention a mechanism in which modality-specific representations are maintained regardless of cross-modality and intramodality, and information is always maintained according to the input modality. However, some problems in the data analysis method used should be noted. In the next section, we discuss the pros and cons of the proposed hypothesis.

Seemüller et al. (2011) provided a summary analysis of two situations—namely, intramodal and cross-modal. Specifically, in the analyses focusing on modality during encoding, the modalities of V–V and V–K using eyes during encoding (K–K and K–V using hands during encoding) were analyzed together. Similarly, in the analyses focusing on modality during testing, V–V and K–V using eyes during testing (K–K and V–K using hands during testing) were analyzed together. The results of this analysis had the disadvantage of not being relatable to purely cross-modal situations, as they always include intramodality. As Seemüller et al. found, if information is always retained in a format specific to the encoding modality, and representation is to be transformed during the test, only the cross-modal setting should be considered; in other words, it is necessary to exclude the type of intramodal setting from the analysis. Specifically, cross-modal VK and KV tasks should be individually analyzed to determine whether any can consistently produce a selective interference effect based on the encoding modality. However, no such analysis has been conducted. Therefore, the results of these combined intramodal and cross-modal analyses raise the question of what kinds of representations are used and to what extent they can explain how information is passed between modalities.

Can we resolve this question using studies that examine intramodal and cross-modal analyses separately? In a study by Fujiki and Hishitani (2010, 2011) that focused on intramodal memory, a recognition task was used, in which only the eyes were used for encoding and testing, and it was explored whether the eyes and hands had interference effects on the maintenance of information. Stimuli were presented visually with black dots in two positions in a sequence, and participants were asked to remember in which direction the dots moved. After a delay of 5 s, they evaluated whether the direction was the same as the one previously presented. Of the two positions, the first was always fixed and the second was determined to be “yes” if it was the same and “no” if it was different. In the eye-interference condition, the participants used their eyes to draw a square in their imagination with their eyes closed during the delay; in the hand-interference condition, the participants tapped on a desk with the index finger of their right hand to draw a square with their eyes closed. In the no-interference condition, the participants were presented with a randomly selected digit from 1 to 9 and were asked to give the following number (e.g., to say “2” after hearing “1”) as quickly as possible. The results indicated that the memory performance in the eye-interference condition was significantly lower than it had been in the hand-interference and no-interference conditions, indicating a selective interference effect with the eyes. Furthermore, when they examined the effects of rehearsal tasks, in which the participants spontaneously and repeatedly moved their eyes and hands to confirm the memorized information, they identified a selective facilitation effect by the eyes. These results indicate that in the eye–eye intramodal memory, only modality-specific representations, not modality-nonspecific representations, are responsible for the maintenance of the information.

Later, Fujiki and Hishitani (2011) examined the case in which only the hands were used for task performance and reported selective interference and facilitation effects in the hands. This result indicates that modality-specific representations can be maintained by the hand in different ways from those observed by Fujiki and Hishitani (2010).

The intramodal study of Fujiki and Hishitani (2010, 2011) suggests that modality-specific representations can be consistently retained. At first glance, this result appears to be consistent with those of Seemüller et al. (2011), who analyzed intramodal and cross-modal data together. However, when considering the results of spatial working memory studies wherein eyes and hands are used during testing, in addition to the intramodal task of Fujiki and Hishitani (2010, 2011), the mechanism of information transfer should be fully discussed in cross-modal memory and further possibilities need to be envisioned. Thus, we can consider a different mechanism than the one proposed by Seemüller et al. (2011), which is always maintained in the input modality-specific style.

In the study of spatial working memory, the most common format is a recall task in which only the eyes are used for encoding, and both the eyes and hands are used for testing. In this setup, the participants observe the locations of the objects visually presented on the display and then represent the locations of the objects in order with their hands. Either an eye movement or a hand-movement interference task was employed as an interference condition. The results consistently report interference effects for eyes and hands. In other words, when both are used during testing, the interference from the corresponding eyes and hands is observed, and when only the eyes are used during testing, the interference arises from the corresponding eyes, as reported by Fujiki and Hishitani (2010). This suggests that information may be maintained in reference to the output modality. Therefore, as Seemüller et al. (2011) argued, not only is the representation of the maintenance process determined by the modality used during encoding; it is also possible that the modality used during testing may affect maintenance.

This review of the recent work of Seemüller et al. (2011) and Fujiki and Hishitani (2010, 2011) shows that at least modality-specific representations are being used to maintain the information. However, the question of how the modality specificity of the representations is determined in cross-modal memory remains to be discussed. Seemüller et al. (2011) proposed a mechanism according to which modality-specific representations related to the eyes and hands used for encoding are stored, regardless of whether they are cross-modal or intramodal. Conversely, Fujiki and Hishitani (2010, 2011) and spatial working memory studies together provide another possibility that the representations used during testing are maintained.

An overview of visual–haptic cross-modal memory research suggests that in addition to modality-specific representations, modality-nonspecific representations may reinforce memory consolidation (e.g., Cattaneo & Vecchi, 2008). In other words, it may be that both modality-specific and modality-nonspecific representations are responsible for information retention in the use of cross-modal memory. Thus, it cannot be dismissed that modality-nonspecific representations play a role in cross-modal memory. Although such perspectives exist, Lacey and Campbell (2006) suggested that there may in fact be several representations of modality, in relation to their respective settings during encoding and testing, and that information may be retained in multiple formats. In their study, visual, tactile, or verbal interference effects did not occur in visual–haptic cross-modal memory. Contrarily, the independent functioning of multiple modality representations can prevent interference if one modality is unavailable due to interference, with retention assumed for other modality-specific representations instead. In other words, multiple modality-specific representations can retain information in coordination by assuming alternative functions without the intervention of modality-nonspecific representations. Because this theory is based on visual–haptic cross-modal memory and not on the memory of the movement and position change of the object in this study, it is important to examine cross-modal memory with the involvement of the eye and hand movements. A further review of previous studies reveals several different theories, and the validity of any one of them for cross-modal memory involving the eyes and hands is unclear.

Therefore, we focus on recent research results and whether the modality specificity of the maintained representations is specified as input or output modality. To address this issue, this study assembled a pure cross-modal memory task (eye–hand, hand–eye) to examine the effects of the secondary task of the eyes and hands on the performance of the memory task to elucidate how the information is maintained. Although Seemüller et al. (2011) only examined the interference effect alone, this study followed Fujiki and Hishitani (2010, 2011) by examining the rehearsal effect as well. The interference effect examines the decline in memory performance via a secondary task that is different from the main task. In general, if memory performance does not decline, it is possible for the representations used in the secondary task not to be involved. However, as Lacey and Campbell (2006) noted, it is also possible to conclude that in cross-modal situations, eye and hand representations can be used in a complementary manner, such that interference effects do not occur. If this interpretation is correct, there should be no-interference effect but only a facilitation effect from voluntary rehearsal work. As Fujiki and Hishitani (2010, 2011) repeatedly confirmed, the rehearsal effect provides the ability to directly identify the type of representation that is most effective for a given task by rehearsing the same information as that received in the main task through the eyes and hands. This study sought to obtain more reliable findings that can be interpreted in terms of various possibilities by combining the rehearsal method that expects not only interference effects but also facilitation effects.

The hypotheses to be tested are as follows:

  • Hypothesis 1: Representations based on input modality at encoding are retained as they are, and the conversion is effected at the time of the test.

  • Hypothesis 2: After encoding, information is transformed into the testing output modality and is retained in that form.

  • Hypothesis 3: Any representations based on the input and output modalities used in encoding and testing, respectively, are retained.

Experiments

Four experiments were conducted: Experiments 1A and 1B investigated the rehearsal and interference effects in the eye–hand cross-modal memory, and Experiments 2A and 2B investigated the same effects in the hand–eye cross-modal memory. The procedures for all these experiments were ethically approved by the Ethics Council of Hokusei Gakuen University.

Experiment 1A

In Experiment 1A, we investigated the rehearsal effect on the eye–hand cross-modal memory. The optimal delay time for the rehearsal effect to occur in the eye–hand cross-modal memory has not been determined, as only a few studies have investigated this effect. Two types of delay times were established following Fujiki and Hishitani’s (2011) work on the rehearsal effect on the hand–hand intramodal memory: one for immediate recognition and the other for 10 s after recognition.

Method

Participants

The participants recruited for the experiment were 36 university students (12 male and 24 female participants; mean age = 19.97 years) who were all right-handed and reported normal vision. The sample size was determined by referring to the results of the rehearsal effects analysis confirmed by Fujiki and Hishitani (2010, 2011), and a power analysis was conducted using G*Power (Erdfelder et al., 1996) with the effect size set to ηp2 = .14, power of 85%, and alpha of .05. The required sample size was 14 participants. We decided on N = 36 as the sample size, which adequately satisfies the criterion.

Apparatus

A 17-inch cathode-ray tube display (NEC MultiSync X750) and a personal computer (PC) (Apple Power Macintosh G4) were used to present the stimuli. SuperLab Pro 1.74 (Cedrus Corp.) was used to control the experiment.

An original finger-movement device similar to the one employed by Fujiki and Hishitani (2011) was used to present the test stimuli (Fig. 1). This device was linked to a PC that controlled the experiment using a microcontroller. Therefore, as soon as the finger movement was completed, a signal was sounded for the next operation, and the series of experiments was automatically controlled by the PC. Figure 2a presents the arrangement of the participants, experimenter, finger-movement device, and display. In Fig. 2b, it can be seen that the test stimulus presentation surface of the device showed the same vertical plane as the display.

Fig. 1
figure 1

Schematic illustration of the finger-movement device. Note: When the participant placed his or her finger on (a), the button switch on (e) was turned on, the motor on (f) rewound the rubber belt on (g), and (a), which was fixed by (b) and (c), moved; when (d) hit the lever switch on (h), the motor stopped, and the finger movement was completed

Fig. 2
figure 2

Schematic illustration of the experiment. Note. a Overhead-view illustration. b Lateral-view illustration

Memory task

During encoding, a 0.5° black dot was presented at the center of the white display for 250 ms, followed by a blank screen for 250 ms and a dot of the same size and color for a further 250 ms at one of eight locations 35° from the vertical or horizontal axis (movement distance = 6.68° visual angle). The participants were asked to remember the movement direction of the dot presented at the center and the following presented dot (Fig. 3b).

Fig. 3
figure 3

Target and test stimuli for the memory task. Note. a For hand movements. b For eye movements. Note. c Test stimuli for the memory task. In the case of mismatch trials, the test stimuli (a, b) were presented diagonally, 30° off from the target stimuli (a, b)

Immediately afterward, the participants rehearsed for 2 s to confirm the direction of the dots with their eyes or by moving their hands once or twice. The test stimuli were presented using sound as a cue. During the presentation of the test stimuli, the participants were asked to close their eyes, and the experimenter guided their index finger to the hand-movement start position of the device, which was placed on a vertical surface. When the participants lightly pressed on the start position, their fingers were automatically moved to the end position (the travel distance was approximately 7 cm, with an average movement time of 607 ms).Footnote 1 The starting point of the finger movement was always the center of a circle with a 14-cm diameter (Fig. 3a). The end positions of the finger movements were set to the circumference, and the participants were asked to provide a recognition judgment as to whether the movement direction was the same as that of the previously presented dot. In the mismatch trials, which differed from the previously presented direction, the finger moved in a direction that was about 30° off from the encoding direction (Fig. 3c). To avoid semantic judgments, such as vertical versus horizontal, directions close to vertical or horizontal were excluded in the mismatch trials, and all diagonally moving directions were used. In the matched trials, the finger moved in the same direction as the dot presented at the beginning. For this reason, the participants were presented with all diagonal directions in both encoding and testing. In one block, eight directions were assigned in the matching and mismatching trials, for a total of 16 trials. The order of trials in the block was randomized. The participants’ hands and equipment were hidden by a cover.

Rehearsal conditions

In the rehearsal conditions, the participants confirmed the direction of the black dots by moving their eyes or hands once or twice over a 2-s period, immediately after the black dots were sequentially displayed.

The eye-rehearsal condition required the participants to lightly close their eyes once, using a sound as a cue, and then open them to observe where the dots had moved by moving their eyes. Pearson and Sahraie (2003) reported that when the participants were asked to look freely at a white screen that presented nothing to them during the period of a delay, they performed no active rehearsal work, such as moving their eyes, and no interference or facilitation effects occurred. In that experiment, the lack of a visual target on the screen made it difficult for the participants to properly move their eyes, which may have prevented the eye-rehearsal effect. In this study, we presented a circle to assist the participants in the rehearsal process to enable them to actively move their eyes. Specifically, a 5-mm-wide gray circle with a visual angle of 9.6° was drawn at the center of the display. With this circle as a reference, the participants were asked to reproduce the movement direction of the previously presented dots by moving their eyes.

Further, the hand-rehearsal condition required the participants to close their eyes to a sound cue and move their right index finger to identify where the dots that they had seen previously had moved. During this rehearsal process, they were instructed to move their index finger in the air along the same vertical plane as the display to confirm it. At this point, the participants were asked to confirm the movement of the dots mainly by moving their wrists, with their elbows on the desk and their index fingers extended. By contrast to the eye movement rehearsal condition, no reference circle was used. This was done because preliminary investigations indicated that, if a reference circle were used, the participants’ attention would be drawn to finding the circle with their finger, and this would therefore interfere with the movement direction confirmation task.

In the no-rehearsal condition, a randomly selected number from one to nine was presented audibly, and the participants were required to speak the next number (e.g., “two” if they heard “one”) as soon as possible, on a condition of no active spatial rehearsal, on five trials. The numbers were presented in a prerecorded female voice once every 500 ms through the speakers of a PC.

The no-rehearsal condition was intended as a control condition to suppress attention and the articulatory system. This was done because the level of attention and the influence of verbal memory cannot be wiped out when nothing is done (e.g., Awh et al., 1998; Lacey & Campbell, 2006), and the control condition was set to exclude these effects.

Eye movement measurement via electrooculography (EOG)

A horizontal component of eye movement was observed via electrooculography to confirm that eye movements were performed in the eye-rehearsal condition and not in the other rehearsal conditions. Before the experiment, the participants’ skin was rubbed with a pretreatment agent (Nihon Kohden, SkinPure), and disposable electrodes (Nihon Kohden, Vitrode F, F-150S) were attached to the corners of both their outer eyelids and the center of their forehead. After the electrodes were applied, skin resistance was confirmed to decrease to 30–50 kΩ. Horizontal eye movements were recorded during the practice trials, and each rehearsal task was reviewed for appropriateness before the main trial was started. Figure 4 presents a typical example of eye movements during each rehearsal task for a horizontally moving practice stimulus. As shown in this figure, the rehearsal task was practiced until it was confirmed that horizontal eye movements occurred in the eye-rehearsal condition, and horizontal eye movements did not occur in the hand-rehearsal or no-rehearsal condition. Participants practiced the rehearsal task alone (right and left directions) and the combined memory task (memory stimulus [right and left directions] × judgment [matching and mismatching trials]) for four trials under no-rehearsal, eye-movement rehearsal, and body-movement rehearsal conditions. The experimenter monitored whether the eye did not move in the condition in which it needed to stay still, and whether it moved correctly in the condition in which it did need to move. The participants practiced until they were able to do so correctly, and until they were able to correctly judge whether they had successfully accomplished the task.

Fig. 4
figure 4

Experiment 1A: Example of horizontal electrooculography trace in each rehearsal condition

Procedure

The participants sat on a chair with their chin on the chinrest and their right elbow on the table. First, they practiced moving their fingers using a hand-movement device to become better accustomed to the setup. The vertical (top and bottom) and horizontal (left and right) directions were used for the practice period, but they were not used in the main trial. At a sound cue, the experimenter guided the participants’ index finger onto the finger rest at the origin point of the movement, and the participants pushed their finger onto this point to start the movement. The participants were required to keep their fingers on the finger rest until the finger movement stopped and to remove their fingers from the finger rest after completing the movement. They practiced this until they could smoothly move their fingers to the end position in any direction without being dislodged from the device. In addition, the distance between the elbow position on the table and the device, the height of the chin rest, and the position of the chair were adjusted for each participant to avoid putting an unreasonable burden on their finger and wrist movements. The main trials were conducted to the adjusted settings obtained during the practice session.

After the participants practiced moving their fingers using the hand-movement device, practice trials were conducted for each rehearsal condition, either with the rehearsal task only or with a combination of memory and rehearsal tasks. In practice trials that combined the memory and rehearsal tasks, the participants were required to observe and remember the movement of the dots, which was presented with a sound cue. For the horizontal practice stimuli, the eye movements were monitored using EOG, and it was confirmed that the eye movements occurred in accordance with the movement of the dots. When the experimenter observed that the eyes were not moving, the participants were instructed to follow the dots with their eyes.

One of the three secondary tasks was performed for 2 s, again cued by a sound. Immediately after this, the finger was immediately moved for recognition. In the 10 s after the recognition, the participants were asked to report the number next to the one that was then presented aurally as quickly as possible, for a duration of 10 s, as in the no-rehearsal condition, for 10 iterations. The test stimuli were presented with the participants’ eyes closed. First, the experimenter guided the participants’ index finger onto the finger rest at the starting position of the movement when cued by a sound. The movement started as soon as the participants pressed on the spot; after their finger reached the end point, they were asked to verbally answer either “yes” or “no” in response to whether the movement direction was the same as that of the dot that they had seen previously. The sequence of this trial is illustrated in Fig. 5a.

Fig. 5
figure 5

Trial timing combining hand–eye cross-modal memory task and secondary task. Note. a Combination of eye–hand memory task and rehearsal task. b Combination of eye–hand memory task and interference task

The participants’ responses were fed back to them in the practice trials. For the horizontal practice stimuli, the eye movements were monitored by EOG, and it was confirmed that the eyes did not move with the movement of the fingers. The main trial was then conducted in the same manner as the practice trial, but no feedback was given.

The combinations of rehearsal conditions and delay times were arranged in a total of six blocks. Each block was made up of 16 trials for a total of 96 trials. Each block was preceded by a rehearsal-only practice and a practice trial that combined a memory task and rehearsal. In the practice trial, the same eight directions as those in the main trial were presented once preceding each block, clockwise from the top right. In the practice trials, one of two directions per quadrant was assigned to match the trials, and the other was assigned to mismatch the trials; the order of the trials was the same for all blocks. A total of three sessions of experiments were conducted, where one session consisted of two blocks of immediate and delayed tests for each rehearsal condition. The sequences of each rehearsal condition and of the delay time within each session were counterbalanced. Each session was followed by a 5-min break, and the experiment took approximately 2 h.

Results and discussion

A two-way analysis of variance (ANOVA) with repeated measures of rehearsal condition (eyes, hands, and none) × delay time (immediately and 10 s later) was conducted on the percentage of correct responses. The results indicated a significant main effect for the rehearsal condition, F(2,70) = 5.34, p < .01, ηp2 = .13, and delay time, F(1, 35) = 5.35, p < .05, ηp2 = .13. The interaction for the delay time and rehearsal condition F(2, 70) = 1.02, n.s., was not significant. Multiple comparisons of the rehearsal conditions using the Ryan method revealed that the eye-rehearsal, t(70) = 2.45, p < .05, d = .33, and hand-rehearsal, t(70) = 3.10, p < .005, d = .42, conditions had significantly higher levels of correct responses than the no-rehearsal condition; no significant difference was observed between the eye- and hand-rehearsal conditions, t(70) = 0.69, n.s. (Fig. 6). For the main effect at delay time, the correct response rate was higher for immediate recognition (M = 85.41%, SD = 12.13) than for recognition after 10 s (M = 82.61%, SD = 12.84). Table 1 presents the correct response rates for all conditions.

Fig. 6
figure 6

Experiment 1A: Mean percentage of correct responses in each rehearsal condition for the eye–hand cross-modal memory task. Note. The error bars represent standard errors

Table 1 Experiment 1A: Mean percentage of correct responses (and standard deviations) in each rehearsal condition for the eye–hand cross-modal memory task

Regardless of the delay time, the eye- and hand-rehearsal conditions produced significantly higher percentages of correct responses than the no-rehearsal condition, indicating that both exerted a facilitating effect. This indicates that both the eyes and hands are involved in the rehearsal function for the cross-modal memory task, where the eyes are used for encoding and the hands for testing.

Experiment 1B

In Experiment 1B, we investigated the interference effect in the eye–hand cross-modal memory. Fujiki and Hishitani (2010) found that eye rehearsal had a selective facilitation effect on an eye–eye intramodal memory task. When we examined the interference effect using the same intramodal task, the same effect of selective interference by the eye was observed. Similar selective facilitation and interference effects were similarly observed in the study by Fujiki and Hishitani (2011), who examined hand-to-hand intramodal memory. The results of these two studies indicated that the rehearsal and interference effects can be expected to be similar. Therefore, if both eye and hand representations are used in the same way as in Experiment 1A, we would expect to observe interference effects between the two.

However, for the cross-modal tasks, there is another possibility. When the task involved a single representation, such as in the intramodal task, facilitation and interference produced similar effects, but when the task involved multiple representations, the possibility that interference effects would not occur has also been identified. Lacey and Campbell (2006) compared four conditions of visual/tactile cross-modal memory tasks—namely, controlled condition, visual interference, verbal interference, and tactile interference. They found that visual and verbal interferences reduced memory performance for unfamiliar objects more than it did in a controlled condition, but for familiar objects, the results were similar to those of the controlled condition. This indicates the possibility that a network of different representations can be formed for familiar objects, in addition to the possibility that they are simply less likely to perform poorly. Also in the present study, it proved necessary to consider task characteristics in relation to the disappearance of the interference effect, but at least by using a memory task similar to that in Experiment 1A, where both eye and hand-rehearsal effects were obtained, the possibility of the formation of a network with different representations could be verified in no small way.

Therefore, in Experiment 1B, we will examine which of the following two hypotheses applies.

  1. (1)

    If both eye and hand-interference effects occur, then both the input-modality and output-modality representations are occurring.

  2. (2)

    If no-interference effect occurs for either the eye or hand, then the network formed by the representations of the eyes and hands functions in a complementary manner.

Note that in both hypotheses the representations of input and output modalities are used. However, they differ on whether input and output modalities function cooperatively (Hypothesis 1) or independently (Hypothesis 2). When functioning cooperatively, modality-unspecific possibilities may need to be reconsidered because the results are similar to those obtained in spatial working memory studies. Conversely, the hypothesis of independent functioning will require further discussion, including further possibilities, because of few prior findings and a lack of sufficient comparative data. Nevertheless, interference effects are a technique that has yielded stable results in working memory research. Therefore, in Experiment 1B, the same experimental environment and stimuli as in Experiment 1A will be used for the validation in order to advance the discussion while corresponding to the results of the rehearsal effect.

In Experiment 1A, it was unclear whether the effects of the rehearsal task would be reflected immediately or only after 10 s; thus, two delay times were used: immediate recognition and recognition after 10 s. In the result, rehearsal had the same effect in both delay terms. No interaction was observed between the delay time and the rehearsal condition. For this reason, we set only a 5-s delay in the next experiment on the interference effect because we thought that the delay time would not change the effects of the condition. For the interference task, to determine whether the execution of the secondary task could contribute to the deterioration of information, a delay time of 5 s, corresponding to the process of information deterioration, was deemed appropriate. In the eye–eye or hand–hand intramodal studies of Fujiki and Hishitani (2010, 2011), the same delay time was used for the interference task, and an interference effect was observed.

Method

Participants

The participants were 18 college students (eight males and 10 females, mean age = 20.44 years) who were right-handed and reported normal vision. The sample size was determined by referring to the results of the interference effects analysis confirmed by Fujiki and Hishitani (2010, 2011), and a power analysis was conducted using G*Power (Erdfelder et al., 1996) with the effect size set to ηp2 = .24, power of 85%, and alpha of .05. The required sample size was eight participants. We decided on N = 18 as the sample size, which adequately satisfies the criterion.

Apparatus

The same apparatus was used as in Experiment 1A.

Memory task

The same memory task was used as in Experiment 1A.

Interference conditions

In all conditions, the participants performed the secondary task during a 5-s delay and were required to close their eyes during the delay to enable them to concentrate on the task.

In the eye-interference condition, the participants were asked to move their eyes in such a way as to draw a square with their gaze. During practice, the participants visually followed a gray dot (0.8° visual angle in diameter) as it moved counterclockwise from the upper right corner of the display to trace out a square (9.6° visual angle per side). They were then asked to close their eyes and move their eyes in the same way in the same period of time. This practice sequence was repeated twice. In the main trials, instead of following the dots, the participants moved their gaze with their eyes closed and on their own but in the same manner as in the practice.

In the hand-interference condition, the participants were asked to draw a square on the same vertical surface as the display, placing their elbows on the desk and with their right index fingers extended, moving their wrists as widely as possible. During the practice, after watching the experimenter draw a square with their index finger, moving counterclockwise, starting from the upper right, the participants were asked to draw a square with their own index finger with their eyes held closed. This practice sequence was also repeated twice. During the main trial, the participants were required to move their hands as they had done in practice.

In the no-interference condition, the participants performed the same number-chanting task as in the no-rehearsal condition of Experiment 1A and were required to verbally respond five times to the next number they heard.

Eye movement measurement via electrooculography

The eye movements were recorded in the same way as Experiment 1A. When presenting horizontal practice stimuli, we confirmed that all experimental participants performed eye movements in the eye-interference condition, and they performed no eye movements in the other secondary tasks. We then proceeded to the main trial.

Procedure

After two black dots were visually presented in sequence, as in Experiment 1A, the participants were placed in one of the three interference conditions for 5 s with their eyes closed. As in Experiment 1A, the participants were required to verbally provide a recognition of their judgment. The sequence of this trial is illustrated in Fig. 5b.

The experiment was conducted in three blocks of 16 trials each for the interference condition; in total, 48 trials were conducted, which formed the main trial. The order of the blocks was counterbalanced. Practice trials combining the memory and interference tasks were conducted as in Experiment 1A. The time required for the experiment was approximately 1.5 h.

Results and discussion

A one-way ANOVA with repeated measures of variance revealed no significant differences in the main effect of the interference condition on the percentage of correct responses, F(2, 17) = .04, p > .10, ηp2 < .01 (Table 2). Further, BF10 = .05. Bayes factors provide the strength of the evidence of test: the ratio in which the alternative hypothesis is favored over the null hypothesis, with values greater than 3 providing evidence in favor of the alternative hypothesis, whereas values less than 1/3 are considered in favor of the null hypothesis (Jeffreys, 1961; Kass & Raftery, 1995). That is, neither the eye- nor the hand-interference condition exerted an interference effect that reduced memory performance, relative to the no-interference condition. When only the interference effect is implemented, and examination of the rehearsal effect is omitted, where no-interference effect occurs, it is generally considered to indicate the possibility that neither the eyes nor the hands contribute to the rehearsal function. However, where the rehearsal effect (Experiment 1A) appears, using the same memory task, two representation types in relation to input-modality-specific and output-modality-specific representations were confirmed to have information retention functions in the eye–hand cross-modal memory. Taking this result into account, the disappearance of the interference effect in Experiment 2 may relate to the mechanism identified by Lacey and Campbell (2006). In other words, where there is interference by the eyes, output-modality-specific representations associated with the hands are responsible for the information retention. For hand interference, the input-modality-specific representations associated with the eyes may have been responsible for the information retention. In situations where an interference task is being performed by the eyes as they are being used for information input, representations by the hands used for information output are responsible for information retention. Similarly, when hand interference is used for information output, the representations by the eyes used for information input may be responsible for the information retention. As a result, it is possible that the originally expected decline in memory performance would not occur.

Table 2 Experiment 1B: Mean percentage of correct responses (and standard deviations) in each interference condition for the eye–hand cross-modal memory task

However, the validity of this mechanism must be examined more closely. In Experiment 1, we applied a setting in which the eyes were used for encoding and the hands for testing, but it was unclear whether the same results could be obtained for the hand–eye cross-modal memory, where the two are reversed. Therefore, in Experiment 2, we conducted two types of experiments on the hand–eye cross-modal memory. In particular, we examined whether both the input-modality-specific and output-modality-specific representations contribute to the retention of information.

Experiment 2A

In Experiment 2A, we examined whether the hand–eye cross-modal memory produced the same results as the rehearsal effect in Experiment 1A. To examine the rehearsal effect, the delay times were set to 0 and 10 s, as in Experiment 1A.

Method

Participants

The participants were 24 college students (nine males and 15 females, mean age = 20.5 years) who were right-handed and reported normal vision. Because sufficient rehearsal effects were obtained in Experiment1A, we decided that rehearsal effects can be verified with a smaller sample size. The sample size was set to N = 24, which adequately satisfies the same criteria as in Experiment 1A and is the size adopted by Fujiki and Hishitani (2011).

Apparatus

The same apparatus was used as in Experiment 1A.

Memory task

All memory stimuli were presented with the participants’ eyes closed. At the beginning of the trial, the experimenter guided the participants’ index finger to the finger rest at the starting position of the finger-movement device. When the participants lightly pressed the finger rest, their finger was automatically moved to the end position. The starting position of this finger movement was always the center of a circle with a diameter of 14 cm. The end of the finger movement was located at one of the eight locations arranged on the circumference and 35° from the horizontal and vertical axes (Fig. 3a). The participants were asked to remember the entire movement direction from the start to the end position. Immediately after this, the participants rehearsed the movement direction of the dots by moving their eyes or hands once or twice for 2 s. The test stimuli were presented via a sound cue, after which the participants opened their eyes. A black dot with a 0.5° visual angle was presented at the center of a white display for 250 ms, followed by a blank screen for 250 ms, after which the same dot was presented on the circumference for 250 ms. The participants identified whether the movement direction was the same as that of the previously presented finger. In mismatch trials that featured a differently presented direction, the dots moved approximately 30° off of the encoding direction. In one block, eight directions were assigned as matching and mismatching trials, for a total of 16 trials. The order in which the trials were conducted was randomized within each block. The participants’ hands and devices were covered by a shroud.

Rehearsal conditions

In the rehearsal condition, immediately after the participants moved their fingers, they were asked to confirm the movement direction of the dot with their eye or hand once or twice over a period of 2 s. All rehearsal conditions were the same as in Experiment 1A.

Eye movement measurement via electrooculography

The eye movements were recorded in the same way as in Experiment 1A. When we presented the horizontal practice stimuli, we confirmed that all experimental participants performed eye movements in the eye-rehearsal condition and no eye movements in the other secondary tasks. We then proceeded to the main trial.

Procedure

The procedure was performed in the same manner as in Experiment 1A.

Results and discussion

A two-way ANOVA with repeated measures of rehearsal condition (eyes, hands, none) × delay time (immediately and 10 s later) was conducted on the percentage of correct responses. Significant effects were observed for the rehearsal condition, F(2, 46) = 7.04, p < .005, ηp2 = .23, and delay time, F(1, 23) = 5.84, p < .05, ηp2 = .08. The interaction between the delay time and rehearsal condition, F(2, 70) = .80, n.s. was nonsignificant. Multiple comparisons using the Ryan method revealed that the eye-rehearsal, t(46) = 3.72, p < .001, d = .57, and hand-rehearsal, t(46) = 2.31, p < .05, d = .34, conditions had significantly higher percentages of correct responses than the no-rehearsal condition. No significant difference was observed between the eye- and hand-rehearsal conditions, t(46) = 1.41, n.s. (Table 3, Fig. 7).

Table 3 Experiment 2A: Mean percentage of correct responses (and standard deviations) in each rehearsal condition for the hand–eye cross-modal memory task
Fig. 7
figure 7

Experiment 2A: Mean percentage of correct responses in each rehearsal condition for the hand–eye cross-modal memory task. Note. Error bars represent standard errors

As in Experiment 1A, the eye- and hand-rehearsal conditions produced significantly higher percentages of correct responses than the no-rehearsal condition regardless of the delay time, confirming the effect of the eye–hand facilitation. Thus, both the input-modality-specific representations that are closely related to the hand and output-modality-specific representations that are closely related to the eyes facilitate information retention in a cross-modal spatial memory task that uses the hand for encoding and the eyes for testing.

Experiment 2B

In Experiment 2B, we investigated whether the same interference effect as in Experiment 1B could be obtained for hand–eye cross-modal memory. As in Experiment 1B, we set the delay time to only 5 s to investigate the interference effect.

Method

Participants

The participants were 18 college students (seven males and 11 females, mean age = 20.0 years) who were right-handed and reported normal vision. The sample size was the number of participants satisfying the same criteria as in Experiment 1B.

Apparatus

The same apparatus as in Experiment 1A was used.

Memory task

The same task as in Experiment 2A was used.

Interference conditions

All interference conditions were as in Experiment 1B.

Eye-movement measurement via electrooculography

Eye movements were recorded as in Experiment 1A. When we presented horizontal practice stimuli, we confirmed that all experimental participants performed eye movements in the eye-interference condition and no eye movement in the other secondary tasks. We then proceeded to the main trial.

Procedure

The procedure was performed in the same manner as in Experiment 1B.

Results and discussion

A one-way ANOVA with repeated measures was conducted on the percentage of correct responses and showed no significant differences in the main effect of the interference condition, F(2, 17) = 1.46, p > .10., ηp2 = .04 (Table 4); further, BF10 = .13, with BF10 < 1/3, providing evidence to support the null hypothesis (Jeffreys, 1961; Kass & Raftery, 1995). This shows a similar result as in Experiment 1B, where no interference effect was produced by either the eye nor hand. Therefore, as in Experiment 1B, in the hand–eye cross-modal memory task, the modality-specific representations used for encoding and those used for testing played a role in information retention. Because these representations could be independently responsible for maintaining information, a mutual relation exists: If one of these does not retain information, the other replaces it.

Table 4 Experiment 2B: Mean percentage of correct responses (and standard deviations) in each interference condition for the hand–eye cross-modal memory task

Difficulty comparison of eye–hand and hand–eye cross-modal memory task

Previous cross-modal memory studies have discussed the asymmetry between eye–hand and hand–eye memory performance (e.g., Connolly & Jones, 1970; Seemüller et al., 2011). Therefore, after examining the differences in memory performance between eye–hand and hand–eye cross-modals, we mention the effects on rehearsal and interference.

In this study, a between-subjects two-factor analysis of variance for the memory task (eye–hand and hand–eye) × delay (0s, 5s, 10s) revealed a significant trend only for the main effect, F(2, 150) = 3.82, p < .10, ηp2 = .03. Eye–hand had higher memory performance than hand–eye. For this analysis, we excluded interference and rehearsal conditions and used the none condition, wherein only speech responses were performed during the delay time, to compare performance reflecting only the effects of the memory task.

Comparing cross-modal memories, some studies have shown that eye–hand has higher memory performance than hand–eye (Diewert & Stelmach, 1977; Posner, 1967), whereas others have shown the opposite (Connolly & Jones, 1970; Jones & Connolly, 1970; Marteniuk & Rodney, 1979). Furthermore, some studies have shown comparable results for eye–hand and hand–eye cross-modal memory performance (Marteniuk & Rodney, 1979; Newell et al., 1979; Seemüller et al., 2012). Marteniuk and Rodney (1979) found that eye–hand and hand–eye memory performance was equivalent when a completely darkened room prevented cueing of ambient visual information, whereas hand–eye had better memory performance than eye–hand when a dimly lit room provided cueing of ambient visual information. Furthermore, when hand movement is passive, hand–eye has better memory performance (Connolly & Jones, 1970; Jones & Connolly, 1970; Marteniuk & Rodney, 1979), but when hand movement is active, eye–hand has better memory performance (Diewert & Stelmach, 1977). Thus, the influence of the experimental environment and task characteristics make obtaining stable results regarding the asymmetry of the memory task difficult. Therefore, regarding the cross-modal memory task in our study, although it is conceivable that the eye–hand task may have been easier to memorize than the hand–eye task, we must be cautious in considering it a result that reflects the mechanism of the retention process.

To avoid relying on these unstable memory task difficulty results, we focused on the rehearsal effect. There was no asymmetric result that eye-rehearsal was more effective for eye–hand cross-modal memory than for hand–eye cross-modal memory or that hand rehearsal was more effective for eye–hand memory than for hand–eye cross-modal memory. The rehearsal effect was similar for eye–hand and hand–eye cross-modal memory, thereby suggesting that it is unlikely that differences in the difficulty of the memory task affected rehearsal and interference effects. Therefore, regardless of the asymmetry between eye–hand and hand–eye memory tasks, it is possible to consider that eye- and hand-rehearsal function equally well in the retention process. In this study, the rehearsal process is verified in terms of rehearsal and interference effects obtained by the secondary task while considering task asymmetry.

Discussion

In this study, we focused on the representations maintained to understand part of the mechanism of cross-modal memory that involves the eyes and hands. In particular, we sought to determine whether the specific modality of the maintained representations in the eye–hand and hand–eye cross-modal memory could be defined by input or output modality.

First, in the case of eye–hand cross-modal memory, both the eye- and hand-rehearsal tasks produced an identical level of facilitation (Experiment 1A). The fact that we obtained such effects through both the eyes and hands excludes the possibility that we necessarily maintain only one type of modality-specific representation, either input-modality-specific representations during encoding or output-modality-specific representations during testing. Instead, we examined the possibility that both input-modality-specific and output-modality-specific representations can be maintained.

In Experiment 1B, we experimented with interference effects. In the case of both eye-and hand-interference effects, we considered the possibility that representations for both the eye and hand were playing a role, as in Experiment 1A. However, neither the eye- nor the hand-interference effects were observed. Because Experiment 1B used the same task as Experiment 1A, it is difficult to ensure that neither the eye nor the hand was used. In the case of the interference task using input-modality-specific representations (eye), it is reasonable to assume that output-modality-specific representations (hand) were responsible for information retention. Likewise, when information retention in terms of output-modality-specific representations (hand) is disturbed, input-modality-specific representations (eye) may be responsible for information retention. However, as there have been very few reports on the disappearance of interference effects in previous studies, further verification is necessary.

Experiment 2 examined the hand–eye cross-modal memory to determine whether the same results as in Experiment 1 could be identified. The results obtained were similar to those for the eye–hand cross-modal memory described earlier (Experiments 1A and 1B): in terms of the rehearsal effect, a facilitation effect was detected for both the eyes and hands (Experiment 2A); furthermore, no-interference effect was observed for either the eyes or the hands (Experiment 2B). The same mechanism was found to be in effect as in the eye–hand cross-modal memory and hand–eye cross-modal memory. This consistency in the results for both the eye–hand and hand–eye cross-modal memory indicates that in cross-modal memory that involved the eyes and hands, modality-specific representations in encoding and testing contributed to information retention.

Contrary to these results, Seemüller et al. (2011) reported a selective interference effect due to modality at encoding. However, a careful review of both settings allows us to draw consistent implications. Seemüller et al. (2011) presented their findings on the combined performance of both intramodal and cross-modal tasks; however, this study analyzed only a cross-modal memory task. In other words, the earlier study demonstrated that intramodal factors affect task performance, but the latter did not consider these factors. Notably, the results regarding the interference effects may have differed between the intramodal and cross-modal tasks. Fujiki and Hishitani (2010, 2011) reported selective interference effects with regard to the eyes in the intramodal memory tasks requiring only the eyes and the hands for intramodal memory tasks requiring only the hands. However, in this study, the interference effect vanished in the cross-modal memory. Thus, it may be that merging of both the cross-modal and intramodal data in Seemüller et al. (2011) reflects selective interference effects in the intramodal memory tasks identified by Fujiki and Hishitani (2010, 2011). Therefore, Seemüller et al.’s findings on selective interference effects should be reexamined with the task settings distinguished between intramodal and cross-modal.

In summary, we propose a new cross-modal mechanism of memory involving the eyes and hands. To begin with, two processes are involved: the first maintains the input information in its form as an input-modality-specific representation, and the other maintains it after it has been transformed into an output-modality-specific representation; for both processes, the information is maintained. This seems to identify a more detailed use of the hypothesis obtained in previous studies.

Previous cross-modal memory studies have compared eye–hand and hand–eye memory performance and used asymmetric results to estimate the rehearsal process (e.g., Connolly & Jones, 1970). In this study, eye–hand tended to perform better than hand–eye. However, the asymmetry of these memory tasks varies depending on the experimental environment and task characteristics and is a poor empirical evidence for proposing a generalizable mechanism. Therefore, rather than focusing on the asymmetry of unstable memory tasks, this study discussed a more generalizable mechanism for the rehearsal process by examining the effects of secondary tasks on memory performance. Results indicated that the same pattern of rehearsal effect was consistently obtained for eye–hand and hand–eye, regardless of the asymmetry in memory task performance. This indicates that examining the rehearsal effect using the dual-task method is useful. Furthermore, the results suggest that modality representation at input and output may function equally in the formation of the optimal format representation for the task.

In addition, Fujiki and Hishitani (2010) found that in the eye–eye intramodal memory, there was no effect of rehearsing information that is input by the eye with the hand; only rehearsal of the same eye exerted a facilitating effect. Likewise, in terms of hand–hand memory, the only facilitating effect was of hand rehearsal (Fujiki & Hishitani, 2010). However, for the cross-modal memory using the hand as the output modality, as in this study, rehearsal effects were obtained for the eye and hand. This indicates that, as proposed by Seemüller et al. (2011), information input from the eyes may not only be retained as eye representations but can also transformed into hand representations (Connolly & Jones, 1970), depending on the representations being output. In other words, where the modality to be used at the time of output is different, it is useful to maintain the transformed representations.

Brain imaging studies examining cross-modal perception in the eyes and hands have demonstrated that the anterior intraparietal sulcus is responsible for the transformation of cross-modal spatial information (Tanabe et al., 2005). Although both input- and output-modality-based retention require transformation, the details of the transformation regarding information retained in short-term memory remain unclear. Few details are known regarding when and how transformation can be performed in retaining information. Seemüller et al. (2012) measured EEG for each stage of encoding, delay, and recognition, and they found that the alpha power was lower in the cross-modal than in the unimodal recognition, suggesting a direct interaction between the eye and hand in recognition. However, the results of this study indicate that rehearsal effects were obtained even when the transformation was immediately performed after encoding, suggesting that the transformation can be performed, not only during recognition but also during encoding. The process of transformation and retention of the representation of output modality immediately after encoding, an outcome of this study, requires further investigation.

Next, we observed the consistent disappearance of interference effects in the interference task experiments. Lacey and Campbell (2006) provided a positive interpretation of this phenomenon in familiar objects, observing that visual and verbal interference effects were obtained in recognition performance in unfamiliar objects. Additionally, working memory studies involving vision and tactile have revealed the possibility that separate modality-specific memory systems parallelly hold information (Katus & Eimer, 2018, 2020). Considering these research results, we may consider the possibility that systems with two types of representations, eye and hand, exist parallelly and that they perform complementary functions in this study.

In previous cross-modal studies, input-modality specific representations (Seemüller et al., 2011), output-modality specific representations (Connolly & Jones, 1970), and supramodal modal mechanisms (Cattaneo & Vecchi, 2008) have been proposed. These hypotheses do not contradict each other, and the existence of a modality-nonspecific multisensory format common to both has been identified while separate memory systems continue to function (Newell et al., 1979; Seemüller et al., 2011; Woods et al., 2004). The existence of modality-specific representations seems certain, in that in everyday life, visual and hand sensory experiences are formed in relation to different functional and neural mechanisms (Ciricugno et al., 2020). In addition, as the memory task in this study was difficult to verbally retain, it was necessary to accurately remember diagonal angles using both the eyes and hands. In the case of a task that requires recalling accurate spatial information using the eyes and hands, the representation generated by the eyes and that by the hands may have exhibited different qualities even though they share some information.

If two types of representations are formed through the eyes and hands, it is conceivable that interference effects may arise when an interference task is performed that interferes with both the eyes and hands instead of only with the eyes or hands. However, it may be that the memory performance decreased due to the increased cognitive load in the interference task using both the eyes and hands; thus, it is necessary to use an experimental design that would prevent the problem of cognitive load. In this study, we examined the effects of interference using a single modality. However, to deepen our understanding of the results of the study of the disappearance of interference effects in Experiments 1B and 2B, it may be necessary to develop an experimental design to examine the interference effects of a combination of eyes and hands for further verification.

Finally, this study demonstrated that input-modality-specific and output-modality-specific representations exist in the eye–hand and hand–eye cross-modal memory. The results also indicated that, in cross-modal memory, the modality-specific representational systems of input and output act independently and may function complementarily, depending on the situation. Future research should investigate the independence of and relation between input-modality-specific and output-modality-specific representational systems. In particular, it should be established whether these mechanisms are observable in task settings other than the experimental environment adopted here, and the extent to which these maintenance structures and coordination functions are usable should be investigated. The spatial study of working memory, which focused on the retention of spatial location, was grounded in a situation that differed from the cross-modal setting, in which only the eyes were used during encoding, and the eyes and hands cooperated with each other during testing. It remains unclear whether information is maintained separately using modality-specific or modality-nonspecific representations in situations involving eye–hand coordination. Memory settings that require coordination of this kind should also be further examined. Moreover, another issue to be resolved here is the extent to which these mechanisms may apply to other characteristics of stimulus, such as spatial resolution and the dynamics of motion. These questions should be investigated in future studies.