1 Introduction

In typical human–human interactions, the interlocutors use not only speech and language but also a wide variety of paralinguistic means and nonverbal behaviors to signal their speaking intentions to the partner, to express intimacy, and to coordinate their conversation (Argyle et al. 1968; Beattie 1978, 1980; Clark 1996; Kendon 1967; Kleinke 1986; Mehrabian and Wiener 1967; Mehrabian and Ferris 1967). Recently, there has been growing interest in the automatic analysis of conversational data, particularly the automatic analysis of multiparty conversations. Related studies have been launched within several communities, including those of human–computer interaction, machine learning, speech processing, and computer vision, with the aim of furthering our understanding of human–human communication and multimodal signaling of social interactions (Gatica-Perez 2009; Pentland 2005; Vinciarelli et al. 2009). In such multiparty conversations as a group of people informally chatting with each other or people attending a more formal meeting, it is obvious that the coordination and interaction cannot be managed in a similar way as how it is done in dialogues between two speakers who share the responsibility for coordination. Various multimodal corpora in multiparty conversations have been collected as fundamental research resources for developing automatic analysis technologies (Carletta et al. 2005; Garofolo et al. 2004).

Thanks to advanced technology, it is possible to study communicative behavior and social signaling patterns in multiparty conversations using automatic analysis techniques on these multimodal corpora. Speaker diarization (the process of partitioning an input audio stream into homogeneous segments according to speaker identity), when used together with speaker recognition (the identification of speakers by their voices), has become an important key technology for tasks such as navigation, retrieval, and high-level inference from audio data in meeting recordings. Some speaker diarization systems integrate motion and gazing data analyses with audio data analysis to achieve higher accuracy and robustness (Anguera et al. 2012; Moattar and Homayounpour 2012). There are also meeting systems that use multimodal data including both motion and gaze (Hain et al. 2010; Tur et al. 2008). Besides speech recognition, motion capture and gesture recognition technology can be used for automatic analysis of multimodal data, and the developments in eye-tracker technology allow us to study gaze behavior in an objective manner. Many quantitative studies on human–human interaction have also reported that eye gaze plays an important role in monitoring conversation content and contributes to the performance of collaborative tasks requiring the understanding of communication partners (Boyle et al. 1994; Clark and Krych 2004; Jokinen et al. 2013). These findings on human–human interactions were mainly obtained from conversations held in the mother tongue (L1). Foreign languages are used by people who travel around the world for business or pleasure and when they chat through the Internet with those living in other countries. Second-language (L2) conversations are commonly observed in daily life, and the proficiency of conversational participants typically ranges from low to high. Such differences in the proficiency of participants can cause serious miscommunications and may disrupt collaboration by both native and non-native speakers in human–human communication (Beyene et al. 2009). Uneven proficiency in L2 may also lead to uneven opportunities for participation in conversations. A multiparty conversation consists of “ratified participants” (Goffman 1976), and participants with poorer proficiencies might be relegated to “side participant” status regardless of their level of expertise in the tasks they are working on collaboratively. It is therefore an urgent issue to develop technologies for monitoring the understanding and the contributions of all participants and for supporting smooth interactions in L2 conversations. To achieve this aim, we need to extend automatic analysis techniques to human–human interactions where participants are conversing in L2.

However, few multimodal corpora of conversations involving L2 usage have been constructed for analyzing communicative behavior and social signaling patterns, which are assumed to be different from those in strictly L1 conversations. Furthermore, there has been a near-total lack of multimodal corpora made from L1 + L2 conversation by the same interlocutors to precisely analyze the differences in their L1 and L2 communicative behaviors.

Our previous studies in multiparty conversations suggested that gazing activities, one of the most important features for monitoring conversation content and understanding communication partners, were different between conversations in L1 and L2: the gazing duration by listeners in conversations in L2 are longer than those in L1, although there was no such difference in the gazing activity of the speaker (Yamasaki et al. 2012; Kabashima et al. 2012). Although these previous studies suggested differences in communicative behaviors, such as utterances and eye gazes, between the conversations in L2 and those in L1, the data size was insufficient to take into consideration the various factors expected to affect communicative behaviors, such as L2 expertise and conversational topic.

To investigate the differences in communicative activities by the same interlocutors in Japanese (their L1) and in English (their L2), an 8-hour multimodal corpus of multiparty conversations was collected. Three subjects at various levels of conversational expertise in L2 participated in each conversational group, and they had conversations on free-flowing and goal-oriented topics in L1 and L2. Their utterances, eye gazes, and gestures were recorded with three microphones, eye trackers, and video cameras. We collected a total of 80 conversations by 20 conversational groups. Quantitative analyses were conducted to compare differences in utterances and in eye gaze activities caused by the differences in conversational languages, the conversation types, and the participants’ levels of L2 language expertise.

This paper is structured as follows. We describe the multimodal data collected in this research in Sect. 2 and continue with annotation and transcription of the data for corpus creation in Sect. 3. We report our analytical results on eye gaze and the characteristics of utterances in Sect. 4 and present a discussion in Sect. 5. Our conclusions are given and future works are discussed in Sect. 6.

2 Data collection

It has been shown in previous research that mutual gaze is important in the coordination of interaction (e.g., Kendon 1967; Argyle and Cook 1976), but these studies mainly dealt with two-party dialogues, not multiparty conversations. We collected multimodal data in three-party conversation to investigate whether the role of eye gaze is as important in three-party conversation as in two-party dialogue (Jokinen et al. 2013). Furthermore, we collected 80 multimodal datasets (20 free-flowing in Japanese, 20 free-flowing in English, 20 goal-oriented in Japanese, and 20 goal-oriented in English) from multiparty conversations to investigate the difference in eye gaze between L1 and L2 conversations. The same twenty groups of three participants each conversed in all four conditions. The average duration of a dataset was 6 min. Table 1 lists quantitative features of the collected data.

Table 1 Quantitative features of collected data

2.1 Experimental setup

Three subjects participated in each conversational group and sat in a triangular formation around a table. The distance between the three participants was 1.5 m (Fig. 1). Three sets of eye trackers, headsets with microphones, and video cameras were used, and the eye gazes, voices and gestures of all three participants were recorded (Fig. 2). A start signal triggered these instruments in synchronization with each other. Students of Doshisha University were recruited to participate in the experiment, and all sessions were conducted in the same lecture room at this university.

Fig. 1
figure 1

Seating positions for three participants during collection of multiparty conversation

Fig. 2
figure 2

Experimental setup

We used NAC EMR-9 (NAC Image Technology Inc.) eye trackers in this experiment. These were cap-mounted eye-tracking systems that enabled the participants to move their heads and hands freely in accordance with their conversational activities. The eye-tracker had one eyesight camera and two eye cameras as well as near-infrared sensors. The eyesight camera recorded the scene that the subject was gazing at. The angle of view was 62°. The eye cameras recorded the eye movements of all participants at a sampling rate of 60 fps. The near-infrared sensors recorded the participants’ pupils and a figure that was reflected from the cornea. Their eye gazes were not tracked when they blinked, when they laughed, or when their eyes had narrowed so much that the eye tracker could not detect their pupils.

2.2 Participants

A total of 60 subjects (20 groups) between the ages of 18 and 24 participated in this experiment as previously explained. They were Japanese university students who had acquired Japanese as their L1 and had learned English as their L2. They were not acquainted with each other before the meeting held for data collection. Their communication levels in English were measured using the Test of English for International Communication (TOEIC). Their scores ranged from 450 to 985 (990 being the highest score that could be attained). Figure 3 denotes the cumulative distribution of their TOEIC scores in comparison with that in the latest TOEIC test administered nationwide (TOEIC_TEST). Both cumulative distributions show nearly the same figure, although the cumulative distribution values of TOEIC_DATA are slightly higher in the lower TOEIC ranges than those of the participants.

Fig. 3
figure 3

Cumulative distribution of participants’ TOEIC scores

The differences in eye gaze and utterance between L1 and L2 conversations may depend on their L2 proficiencies themselves or on the relative difference among their L2 proficiencies in the conversational group. Therefore, we recruited participants of various L2 proficiencies to make up the conversational groups. Accordingly, we assembled groups of various combinations of L2 proficiencies, such as groups of participants with high TOEIC scores, those with low scores, and those with high/middle/low scores.

2.3 Procedure

In the procedure used for the experiment, the content of the conversational topic was first explained to the participants. There were two conversation types. The first was free-flowing, natural chatting that covered various topics such as hobbies, weekend plans, studies, and travel. The second type was goal-oriented, in which participants collaboratively decided what to take with them on trips to uninhabited islands or mountains. We randomly arranged the order of the conversation types to cancel out the effect of order. We also randomly arranged the order of the languages used in the conversations. After the experiment had been explained, eye trackers were calibrated. During calibration, the participants looked at nine points on the calibration board while the system measured the eye’s position and shape and the reflections of the infrared light. This calibration was done for each participant in parallel, and participants started conversing after all of them finished the calibration. Each group had conversations about free-flowing and goal-oriented topics in Japanese and in English. Furthermore, the participants filled out a questionnaire after each conversation. Each question was categorized into features that categorize communication, such as participants’ gazing activities, their feelings toward other participants, their interest in the topic of conversation, their conversational skill in English, and their evaluation of the conversation content (Umata et al. 2013). Consequently, the subjects in each group participated in four conversations and filled out four questionnaires.

3 Corpus creation

3.1 Annotation features

For this analysis, we used the EUDICO Linguistic Annotator (ELAN) developed by the Max Planck Institute for Psycholinguistics (MPIP), which is a linguistic annotation tool for creating text annotations onto video and audio files. We performed the annotation according to the MUMIN annotation scheme (Allwood et al. 2007) used in our previous research for modeling turn-taking behaviors in L1 conversations (Jokinen et al. 2013). The annotation features and values adopted in our previous research are listed in Table 2. Our preliminary test showed that the inter-coder agreement between the annotators, which was measured by Cohen’s kappa coefficient, was not high for some features such as Dialogue Act, Head Movement, and Hand Movement in the case of conversations in L2. Cohen’s kappa coefficients were 0.55, 0.14, and 0.34 for Dialogue Act, Head Movement, and Hand Movement, respectively, in our preliminary test. We decided to start making annotations by limiting them, in the first stage of the research, to the features that were reported to be important in monitoring conversations, such as utterances, eye gazes, and turn-taking activities (Jokinen et al. 2013). These features also maintained high agreement among the annotators. The authors manually determined the start and end times of each utterance by considering only pause durations (500 ms) between consecutive speech segments, that is, not considering the contents of consecutive speech segments. The authors adopted the annotation feature TURN for turn-taking activity such as turn-give, turn-take, and turn-hold, based on the pause duration between consecutive utterances.

Table 2 Annotation features and values in previous research

3.2 Gaze events

As already mentioned, many quantitative studies have reported that eye gaze plays an important role in monitoring conversation content and contributes to the performance of collaborative tasks needing the understanding of communication partners. The authors also selected gaze events as one of the annotation features and manually annotated the feature GazeObject based on the gaze path given by the eye tracker to obtain a more precise annotation feature; moreover, “Gaze at the person to your right,” “Gaze at the person to your left,” “Gaze to other,” (Gaze to objects besides the person to your right or left) and “NoGaze” (Gaze was not detected) were used as the values of GazeObject. Gaze events are defined as gazing at some object, that is, the participant focuses her visual attention on a particular object for a certain period of time (more than 200 ms). In the current study, there are three named objects: the two partners and the “other.” While focusing one’s visual attention on something can also include gaze shifts to the whole context, we concentrate on the local attention level and thus count each gaze shift as a separate gaze event. With gaze events we have to note, however, that small movements around the same gaze object are included in one continuous gaze event. There are two reasons for this kind of eye movement: the so-called saccades, which refer to the eyes’ involuntary and constant movements of their fixation points, and the agent’s own involuntary eye movement around the gaze object while generally focusing on the object. We also include the shifts from one object to the other in the event, within the outline of the gaze object, that is, once the gaze is shifted beyond the outlines of the current gaze object, the gaze event is also changed to another. Gaze signals may also be broken in that there is no signal data. This is due to technical reasons but also to the agent’s visual attention not changing (within 0.2 s). Furthermore, it is clear that when the agent’s visual attention has not changed, the elements are considered part of the same gaze event; otherwise, they are considered two different gaze events.

We tried automatic analyses of eye tracking signals, but we decided to manually annotate the features to obtain more precise annotation because the low resolution of the eye tracker prevented reliable automatic detection of faces and bodies. The authors, therefore, manually annotated the start time and end time of gaze events and their feature values based on the previously described procedures. That is, the time when the gaze shifts within the outline of the gaze object is set to the start time of the gaze event to the object if the gaze does not shift beyond the outlines of the gaze object in less than 200 ms. The end time of the gaze event is set to the time when the gaze shifts beyond the outline of the gaze object if this time is more than 200 ms. Figure 4 shows an annotation screenshot where the video is shown at the top and the annotations are added to the rows at the bottom. Cohen’s kappa coefficient of segmenting gaze events was 0.83.

Fig. 4
figure 4

Screenshots of executing ELAN and enlargement of its annotation rows

3.3 Transcription

Limited vocabulary of non-native speakers forces speakers to express themselves in unsuitable words and non-native speech usually includes less fluent pronunciation as well as mispronunciation even in cases where it is well composed. It was difficult even for native speakers to correctly transcribe the recorded speech with these features even if they can understand speech spoken by non-native speakers in real conversational situations.

After all conversations finished, the recorded voices in the conversations in L2 were transcribed by the participants themselves and checked with a bilingual assistant. The transcription procedures were specified by the authors. For example, when the speaker was laughing or hesitating, the span had to be surrounded by an exclamation point as in !laugh!. Words also had to be bounded by hash marks as in #Tokyo# or #sushi# when speakers uttered a proper noun or a word in Japanese. We developed a tool for linking annotated tags of utterances and their transcribed data. Figure 5 shows the interface of the software tool and examples of linking annotated tags of utterances and their transcribed speech by a participant.

Fig. 5
figure 5

Interface for linking annotated tags of utterances and their transcribed data, with an example of linked data. Each column denotes line number, value of dialogue act (not yet annotated), ID of participant, start, end, and duration of each utterance, and transcription

The transcribed utterances of each participant were time-aligned with those of other participants along the time axis based on the linked results to analyze the relationship between the content of utterances and differences in movements of eye gazes in each utterance of various dialogue acts.

4 Features of conversations

In the following, we describe some features of utterances and eye gazes that were obtained using the conversational data.

4.1 Utterances

Some studies have regarded “nativeness” as “expertise” and compared the grounding process between differing levels of language expertise (Kasper 2004; Hosoda 2006). Based on these ideas, we expected that the linguistic expertise of the participants in L2 would be varied and, moreover, the difficulty in L2 conversation would be greater for participants with lower linguistic expertise in L2. Our assumption was that there was little difference in their linguistic expertise in L1. We also assumed that their linguistic expertise in L2 could be measured by their TOEIC scores and labeled participants with the highest TOEIC scores Rank1, those with the second-highest TOEIC scores Rank2, and those with the third-highest TOEIC scores Rank3 in each conversational group. These rankings were used to analyze the level of difficulty owing to linguistic expertise.

We predicted that the participants would speak more in the L1 conversations. We compared (i) the percentage of silence in the conversation (ii) the number of utterances, and (iii) total utterance duration and average utterance duration between the L1 and L2 conversations as indices of participants’ difficulties in communication in L2. We also analyzed the effect of conversation type and the expertise of the participants on the difficulty of communication in L2.

Table 3 lists basic statistics of the percentage of silence duration, the total utterance duration (TUD), the average utterance duration (AUD), and the number of utterances in four kinds of conversations, i.e., those on goal-oriented topics in L1 and L2 and free-flowing ones in L1 and L2. Figures 6, 7, and 8 show the total and average utterance durations and the number of utterances of the Rank1, Rank2, and Rank3 participants.

Table 3 Basic statistics of utterances
Fig. 6
figure 6

Total utterance durations of Rank1, Rank2, and Rank3 participants

Fig. 7
figure 7

Average utterance durations of Rank1, Rank2, and Rank3 participants

Fig. 8
figure 8

Number of utterances of Rank1, Rank2, and Rank3 participants

4.2 Analyses of utterances

The percentage of silence duration was the greatest in the conversation on the goal-oriented topics in L2, and the smallest in the free-flowing ones in L1. Under the hypothesis that the percentage of silence was greater in conversations in L2 than in L1 and that the conversation type affected the percentage of silence, we conducted an ANOVA test to compare the percentage of silence duration among each group, with both the language difference and the conversation type difference being within-subject factors. The results revealed significant main effects of both language difference (F(1, 19) = 125.6, p < .01) and conversation type difference (F(1, 19) = 37.8, p < .01), and no interaction was observed. This analysis result shows that silence duration is longer in conversations in L2 than in L1 and is also longer in goal-oriented conversations than in free-flowing ones. The analysis result failed to show that the language effect was stronger in goal-oriented conversations than in free-flowing ones, probably due to the large difference in silence duration between goal-oriented and free-flowing conversations.

We expected both the total utterance duration (TUD) and the average utterance duration (AUD) to be longer in L1 than in L2 and, moreover, the conversation type and the expertise in the L2 to affect these features of utterances. Under these hypotheses, we conducted an ANOVA test for TUD and AUD, with both the language difference and the conversation type difference being within-subject factors and L2 expertise being the between-subject factor. For TUD, the results revealed a significant main effect of language difference (F (1, 57) = 95.0, p < .01) and a significant main effect of conversation type differences (F (1, 57) = 10.8, p < .01). There was a significant interaction between the expertise in L2 and the language difference (F(2, 57) = 8.0, p < .01). These analysis results show that TUD is larger in conversations in L1 than in L2 and is also larger in free-flowing conversations than in goal-oriented ones. The analysis result on interaction between expertise in L2 and the language difference shows that the decrease in TUD from conversations in L1 to those in L2 depends on the expertise of the participants and that the decrease of Rank3 was more than the others, as shown in Fig. 6.

For AUD, the results revealed a significant main effect of language difference (F(1, 57) = 31.5, p < .01) and a significant main effect of conversation type difference (F (1, 57) = 6.8, p < .05). This result shows that AUD is larger in conversations in L1 than in L2 and is also larger in free-flowing conversations than in goal-oriented ones. The analysis result of AUD on interaction between expertise in L2 and the language difference was different from that of TUD, and there was no significant interaction between the expertise in L2 and the language difference, although there was a second order interaction (F(2, 57) = 4.6, p < .05) among the expertise in L2, the language difference, and conversation type difference. This result suggests that the difference among AUDs of speakers of each Rank is not so significant as that of TUD.

Under the hypotheses that the number of utterances was greater in conversations in L2 than in L1 and that the conversation type affected the number of utterances, we conducted an ANOVA test with the language difference and the conversation type difference being within-subject factors and L2 expertise being the between-subject factor. For the number of utterances, the results revealed a significant main effect of language difference (F (1, 57) = 37.1, p < .01). There was a significant interaction between expertise in L2 and the language difference (F(2, 57) = 7.4, p < .01). This analysis result shows that the number of the utterances is larger in conversations in L1 than in L2, but there is not so much of a difference between free-flowing conversations and goal-oriented ones. The analysis result on interaction between expertise in L2 and the language difference suggested that the decrease of TUD from conversations in L1 to those in L2 was mainly due to the decrease in the number of utterances by Rank3.

Table 4 lists the ANOVA test results of the percentage of silence durations, TUD, AUT, and the number of utterances.

Table 4 Features of ANOVA test results on utterances

4.3 Gaze events in speaking

As already mentioned, many quantitative studies on human–human interaction have reported that eye gaze plays an important role in coordinating conversations and contributes to the performance of collaborative tasks needing the understanding of communication partners. Previous research (Yamasaki et al. 2012; Kabashima et al. 2012) has suggested that eye gaze in speaking differs according to whether conversation is conducted in L1 or L2. This result suggests that the function of eye gaze in conversations in L2 might be different from that in L1. We analyzed the eye gazes of both speakers and listeners in conversations in L2. More specifically, we analyzed (1) how long the speaker was gazed at by other participants, (2) how long the speaker gazed at other participants in conversations in L2 in comparison with those in L1, and (3) whether the expertise of participants affected their gazing activities.

We used the average of gazing-at ratios to analyze how long the speaker gazed at other participants and the average of being-gazed-at ratios to analyze how long the speaker was gazed at by other participants in the previous research (Yamamoto et al. 2013). In this paper we used the speaker’s gazing ratio and the listener’s gazing ratio to analyze them more precisely. The speaker’s gazing ratio indicates how long the speaker gazed at other participants during his/her utterances and is defined as the ratio of the duration of the speaker gazing at other participants to his/her speaking duration. The listener’s gazing ratio indicates how long a participant gazed at the speaker during his/her utterance and is defined as the ratio of the duration of a participant gazing at the speaker to the speaking duration. Figure 9 illustrates the concepts of the speaker’s gazing ratio and the listener’s gazing ratio. The being-gazed-at ratio shows total gazing activities of both listeners, but the listener’s gazing ratio indicates the gazing activities of each listener independently. The speaker’s gazing ratio is the same as the gazing-at ratio, but we rename the gazing-at ratio as the speaker’s gazing ratio to clarify the relation between the speaker’s and the listener’s gazing activities. The average of the speaker’s gazing ratios is defined as

$$ Average\;of\;speaker\text{'}s\; gazing\; ratios = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \frac{DSG_{j}(i)}{D(i)} \times 100\,(\%). $$

Here, D(i) is the duration of the ith utterance and DSG j (i) is the duration of the speaker gazing at the jth participants (j = 1, 2, 3) in the ith utterance.

Fig. 9
figure 9

Flow diagram for calculating speaker's and listener’s gazing ratios

The average of the listener’s gazing ratios was defined as

$$ Average\; of\; listener\text{'}s \;gazing\; ratios = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \frac{DLG_{j}(i)}{D(i)} \times 100\,(\%). $$

Here, DLG j (i) is the total duration of the jth participant (j = 1, 2, 3) gazing at the speaker in the ith utterance.

In order to evaluate whether the L2 expertise of participants affects one’s gazing activities, we calculated the averages of the speaker’s gazing ratios and the listener’s gazing ratios, and the average listener’s gazing ratios at Rank1, Rank2, and Rank3. In order to evaluate whether the differences among the L2 expertise levels of participants affects one’s gazing activities, we also calculated the average gazing ratio of participants of Rank2 to Rank1, that of Rank3 to Rank1, and that of Rank3 to Rank2 as the average gazing ratio of participants of lower expertise to participants of higher expertise; likewise, we calculated the average gazing ratio of participants of Rank1 to Rank2, that of Rank1 to Rank3, and that of Rank2 to Rank3 as the average gazing ratio of participants of higher expertise to participants of lower expertise.

4.4 Analyses of gaze events in speaking

Table 5 lists basic statistics of the number of gaze events, speaker’s gazing ratios, and listener’s gazing ratios. The basic statistics in Table 5 show that the number of eye gaze events in four conversational conditions have almost the same tendency as the number of utterances in Table 3, that is to say, that the larger the number of utterances is, the larger the number of eye gaze activities. This result suggested a relation between utterances and eye gaze events.

Table 5 Basic statistics of eye gaze activities

As for gazing ratios, the averages of speaker’s gazing ratios seemed to be almost the same in four kinds of conversations. On the other hand, the averages of listener’s gazing ratio were larger in conversations in L2 than in L1, and they were also slightly larger in the free-flowing conversations than the goal-oriented ones. Figure 10a compares averages of listener’s gazing ratios at a speaker of each Rank among the four kinds of conversations, i.e., the free-flowing conversations and the conversations on goal-oriented topics in both L1 and L2. Figure 10b, c compare averages of listener’s gazing ratios of participants of higher expertise to participants of lower expertise, and vice versa, among the four kinds of conversations.

Fig. 10
figure 10

Averages of listeners’ gazing ratios for various listeners in four kinds of conversations. a Average of listeners’ gazing ratios at speaker of each Rank. b Average of listeners’ gazing ratios of participants of higher expertise to participants of lower expertise (higher-listener vs. lower-speaker). c Average of listeners’ gazing ratios of participants of lower expertise to participants of higher expertise (lower-listener vs. higher-speaker)

Under the hypotheses that the speakers were gazed at more in conversations in L2 than in L1, and that the conversation type and the L2 expertise of participants affects one’s gazing activities, we conducted an ANOVA test with the language difference and the conversation type difference being within-subject factors and the expertise of speakers being the between-subject factor. The results revealed a significant main effect of language difference (F (1, 117) = 107.7, p < .01) and a significant main effect of conversation type difference (F (1, 117) = 6.0, p < .05). There was no significant interaction between language difference and differences in expertise of speakers in L2.

This result shows that a speaker is gazed at more by listeners in conversations in L1 than in L2, and in the goal-oriented conversations than in the free-flowing ones in both languages.

We also conducted an ANOVA test with the language difference and the conversation type difference being within-subject factors and the difference between the listener’s gazing ratios of participants of higher expertise to participants of lower expertise and those of participants of lower expertise to participants of higher expertise being the between-subject factor. The results revealed a significant main effect of language difference (F (1, 117) = 107.4, p < .01) and a significant main effect of conversation type difference (F (1, 117) = 6.2, p < .05). There was no significant interaction between language difference and differences in expertise in L2.

As for the average of speaker’s gazing ratios, we could not find any significant difference for either language difference or different conversation type.

Table 6 lists the ANOVA test results of eye gaze activities.

Table 6 Features of ANOVA test results on eye gaze activities

5 Discussions

We have conducted quantitative analyses on utterance and eye gaze using the multimodal corpus of multiparty conversations in L1 and L2. The main points obtained through the analyses are as follows.

5.1 Differences in utterances

The results in Sect. 4.2 indicated that there were significant differences of silence durations, total and average utterance durations, and the number of utterances between conversations in L1 and L2. The shorter total and average utterance durations and the longer silence in L2 suggest the difficulties the participants had in the L2 conversations. These results suggested that the participants produced shorter utterances in the conversations in L2, which are assumed to be simpler expressions; however, content analysis should be conducted to confirm this assumption.

There were significant differences in silence duration, TUD, and AUD between free-flowing conversations and goal-oriented conversations in both L1 and L2. This result suggests that the difficulty of speech production was more serious in goal-oriented conversations than in free-flowing ones, maybe because they need more time to produce speech corresponding to discourse in goal-oriented conversations. We expected that the difficulty would be more serious in conversations on goal-oriented topics in L2 than in L1 due to a shortage of vocabulary and colloquial expressions for the topic; however, we could not obtain such an analysis result. This may be due to the very strong effect of the language difference in comparison to the conversation-type difference.

As for TUD and the number of utterances there was a significant interaction between the expertise in L2 and the language difference; however, we couldn’t find a significant interaction between the expertise in L2 and the language difference for AUD. The analysis result on interaction between the expertise in L2 and the language difference shows that difference between TUD of conversations in L1 and in L2 was mainly due to the decrease in the number of utterances by Rank3.

The decreasing ratio of AUD of Rank2 from conversations in L1 to those in L2 was smaller than those of Rank1 and Rank3. We think that the phenomenon might be a kind of “alignment” (Garrod and Pickering 2004) and that participants of the middle-expertise group (Rank2) play a role of mediating conversation between participants of high and low expertise. We will conduct research on this phenomenon using the transcribed speech data and the annotation on the dialogue act used for grounding.

5.2 Differences in eye gazes

The results in Sect. 4.4 indicated that eye gaze in the L2 conversations were different from those in L1. The speakers were gazed at more by their listeners in the conversations in L2 than those in L1. This phenomenon was found in both free-flowing conversations and those on the goal-oriented ones in L2. Several possible reasons arise to explain the difference between gaze activities in the conversations in different languages, among them: (1) participants monitored their understanding of what was being said to make repairs if necessary, (2) participants used visual information to help in perceiving the auditory information, (3) participants gave a polite acknowledgement of the speaker’s effort in producing speech with difficulty. We should investigate the cause of this phenomenon by analyzing the relation between the gaze activities and the speech act.

5.3 Effect of expertise in L2

The effect of different levels of L2 expertise were tested concerning features of utterances as well as eye gaze behavior. Concerning features of utterances, we found significant interactions between language difference and expertise in L2. As for the average of listener’s gazing ratios we couldn’t find a significant interaction between language difference and expertise in L2.

To investigate the issue further, we conducted ANOVA tests to evaluate the effect of the volume of data on analysis of the interaction between language difference and expertise in L2, adding incrementally data of every two sets to half of the data (10 sets). Figure 11 depicts significance probabilities of interaction between language difference and expertise in L2 that were calculated using both the expertise of the speaker (Gaze_at_Rank) and the eye gaze directions from participants of higher expertise to participants of lower expertise and those of participants of lower expertise to participants of higher expertise (Gaze_of_H2L, L2H). The datasets were numbered according to the order of data collection. As shown in Fig. 11, significance probabilities of the interaction became lower as the volume of data increased due to the increased number of sets and were under the probability 0.05 (a significant interaction level) in both cases. The significance probabilities, however, again increased in both cases as more datasets were used.

Fig. 11
figure 11

Probabilities of interactions between language difference and expertise in L2 for listener’s gazing ratios at each Rank (Gaze_at_Rank) and those from high to low and low to high (GazeH2L, H2L)

These results suggested that the volume of the multimodal corpus was still insufficient to analyze interaction in listener’s gazing activities between language difference and expertise in L2, thus implying the possibility that there might be interaction between language difference and expertise in L2 (refer to Appendix 2).

We analyzed features of utterances and eye gaze of the collected multimodal data only from the stochastic viewpoint in this paper but did not conduct any detailed analysis of the relation between eye gaze and utterances as done for conversations in L1 by Goodwin (1979). Our transcription showed that the participants tended to produce not only simpler expressions but also imperfect or fragmental ones more often in English than in Japanese conversations, probably because of their low language proficiency. We plan to analyze the differences in disfluencies between L1 and L2 conversations and their relation with eye gaze behaviors, although it requires an additional major annotation effort.

6 Conclusion

We collected an 8-hour multimodal corpus of multiparty conversations to investigate the differences in communicative activities by the same interlocutors in Japanese (their L1) and in English (their L2). Although annotation for speech acts and alignment of transcribed speech data are still being conducted, we found some interesting features of utterance and eye gaze by analyzing the conversational data in which annotations on speech, gaze events, and turn-taking activity were completed.

We confirmed that the total and average utterance durations were shorter and that the silence in the L2 conversations was longer than those in the L1 conversations, which suggested the difficulties experienced by the participants in conversations in L2. We also confirmed that eye gazes in the L2 conversations were different from those in L1. Speakers were gazed at more by listeners in conversations in L2 than in L1. The reason why the speaker was gazed at more by listeners in conversations in L2 than those in L1 is still not clear, and analyses of the relations between the gaze activities and speech acts in speaking would seem necessary to clarify this point.

The results on the total utterance durations and the average utterance durations among the participants revealed significant interactions between the expertise in L2 and the language difference. As for eye gaze, we could not find a significant interaction between language difference and expertise in L2, but the experimental results using subsets of the data suggested the possibility that there might be interaction between language difference and expertise in L2.

Considering that second-language conversations are commonly observed in daily life throughout the world, our findings on differences in conversations in L1 and L2 suggest possible directions for future research in psychology and cognitive science as well as in human–computer interaction technologies. This study provided a basis for monitoring the status of all participants by studying the effects of their linguistic proficiency on communicative activities and attitudes in second-language conversations. These results can be used to support mutual understanding and balanced participant contributions, in cases of uneven linguistic proficiencies, in cooperative activities involving computer-supported cooperative work (CSCW). The results may also be important in developing humanoid robots or agents for dialogue-based computer-assisted language learning.

We plan to collect more multimodal data and continue annotation and transcription of multimodal corpora to obtain more reliable results on the differences between participants of different expertise in L2. Moreover, we intend to make the multimodal corpora available to the research community after completing the annotation work.