1 Introduction

In the modern era, with the demand for several specialized courses, universities are increasingly dependent upon live distance learning platforms to recieve expert lectures. However, studies show that students still prefer the conventional-classroom mode of teaching where the teacher is physically present. Distant students often feel an experiential divide as compared to the students present in the teacher’s location [1]. There are several factors attributing to this behavior. Inability to have an gaze aligned interaction is one of the major causes for this experiential divide [14].

In this research, we primarily focus on wholly-online → collaborative → synchronous learning [1]. Each student cluster is assumed to be present in what we call an extended classroom [20]. A classroom is converted to an extended classroom by displaying live video and audio feeds of distant participants be it the teacher or collectively the students of a distant classroom. Thus, a classroom is converted to an extended classroom by introducing distant particpants by means of video conferencing. A lecture session thus consists of several extended classrooms connected together through a high-speed network with several audio and video feeds streamed in coherence.

Such a system at a basic level, inherently lacks in gaze alignment which adversely impacts both teaching and interactions [22]. For instance when the teacher points at any object (teaching aids such as board, demo-models etc) in the classroom, the remote students often lose the perception of where the teacher is pointing at until they identify it through contextual means. This disrupts the smooth flow of attention to the content of the lecture. Another example is when the teacher is pointing at a student intending to start an interaction, both that student being pointed to and the other students are at a loss in identifying “who is the teacher pointing to?”. Thus, preserving directionalities become vital in ensuring a smooth identification of “entity-at-focus” at any point of time. This ensures reduction in cognitive load in following the lectures as well as establishing vicarious interactions.

Systems hitherto focus on preserving eye-contact using several methods (discussed in detail in Section 2). Subtler levels of gaze-alignment are often ignored. The following three levels of gaze alignment are addressed in this paper.

  1. 1.

    Mutual gaze, which simply refers to eye contact between the interacting participants [18].

  2. 2.

    Gaze awareness, which in this context means knowing where others are looking. “The listeners are able to perceive other listeners look at the speaker”- illustrates the effect of gaze awareness [18].

  3. 3.

    Gaze following which reflects an “expectation-based type of orienting in which an individual’s attention is cued by another’s head turn or eye turn” [8]. The attention of a listener can be redirected onto another entity by a gesture (finger point/head-turn/eye-turn); this illustrates gaze following.

We observe the above levels of gaze-alignment in a conventional classroom during normal interactions. For instance, when lecturing, the teacher gazes at all participants and the teaching aid, such as a whiteboard, while all the students look at the teacher. When the teacher interacts with a student, the teacher and the interacting student maintain eye contact (mutual gaze), while all the other participants gaze at either the teacher or the student, depending on who is speaking at that point in time (gaze awareness). When the speaker (teacher/student), points at an object, the gaze directions of all the listeners are redirected from the speaker to that object (gaze following). The above examples illustrates interaction dependency of gaze.

Furthermore, in addition to interaction dependency, gaze directionalities are also dependent on the relative positioning (position dependency) of different entities across the various classrooms. In extended classrooms, the relative positions of participants across the different locations need not be consistent. The placement of the multimedia capture and display devices around the participants affect the outcome of perceived gaze directions as well. In short a typical extended classroom setup fails to account for one or more levels of gaze (detailed description in Section 3). Solutions hitherto mainly focus on preserving mutual gaze, which alone is inadequate in ensuring an effective gaze aligned interaction where, the participants are instinctively able to identify the entity-at-focus with ease.

We do this by introducing a set of cameras and displays around the participants in a classroom at each location in a multilocation setting. The displays present videos of the remote participants for the physical participants of that location. During an interaction, appropriate camera feeds (videos) are streamed to remote displays such that a correct perspective of the remote participant is displayed for the physical participant i.e., which camera in classroom i maps to which display in classroom j. Thus, several video streams are exchanged between each of the classrooms. During each interaction, this camera-display mapping changes and is recalculated.

Our contributions in ths paper are,

  1. 1.

    developing a gaze analysis framework for classroom environment

  2. 2.

    presenting a dynamic gaze correction n-classroom architecture to provide both teacher and the different groups of students an efficient multiparty classroom-interaction platform.

  3. 3.

    Evaluation using a three classroom test-bed to measure the impact on classroom interactions.

The gaze framework consists of 1) position and orientation specification across multiple distant classrooms, 2) vector representation of gaze directions, 3) analysis of gaze behaviour during interactions and 4) modelling distortions during capture and display. The formalism dictates a placement strategy ensuring optimal resource usage; we thus present a scalable, dynamic gaze correction architecture for an n-classroom setup ensuring conformity over all the three levels of gaze. We also present an implementation of a three classroom testbed for evaluation. We conducted both subjective and objective evaluation for both short and long durations for an interaction intensive lecture. The evaluators were asked to identify the “entity-at-focus” at a given time initially without and then with gaze alignment. A marked reduction in the time taken to ascertain the entity-at-focus by the evaluators indicates the ease and fluidity a gaze aligned setup brings to interactions in an eLearning platform. Subjective evaluation result also concurs with the empirics.

As a result of the implementation techniques used, the classroom experience gets transformed from a gaze insensitive experience as shown in Fig. 1a to a gaze aligned experience as shown in Fig. 1b.

Fig. 1
figure 1

Gaze directions in a representative remote classroom (classroom 1) with videos of the teacher and students in classrooms 2 and 3 are shown on displays. a an interaction in which remote students 3 are talking to the Teacher, but gaze is not aligned. b the same interaction with gaze alignment. Arrows indicate the perceived gaze directions of participants

The paper is organized as follows.

  • Section 2: Related Work - describes prior art.

  • Section 3: Problem Description - describes how the three levels of gaze alignment fails.

  • Section 4: Gaze Alignment Framework & n-classroom Architecture - Describes the framework to analyzing gaze dependencies. An n-classroom architecture for a optimal placement of media devices is also discussed.

  • Section 5: Experimental Setup - Describes our experimental setup for a 3 classroom gaze alignment setup.

  • Section 6: Evaluation - Describes the effectiveness of our system in enhancing interactivity in a distance learning setting.

  • Section 7: Conclusion

2 Related work

2.1 Importance of gaze in social interactions

A lot of prior literature work focuses on the importance of social interactions in a classroom environment. Work done by Janet Owens, Lesley Hardcastle [22] shows that remote students do not experience the same inclusive feeling, being a part of the classroom as the locally present students (students present in the same location as the teacher). They often feel a sense of isolation when it comes to interaction with the teacher or with their peers in the remote classroom. Work done by Jung et al. [15] shows that remote student’s learning experience is significantly enhanced through social interaction with peers rather than with the teacher alone. Learners experience a “sense of community”, enjoy mutual interdependence, build a “sense of trust” and have “shared goals and values” when having a face to face interaction in a conventional classroom setup [7].

Monk and Gale [18] in the work “a look is worth 1000 words: Full gaze awareness in video-mediated conversations” describes the concept of full gaze awareness in video-mediated interactions by including “Mutual Gaze” and “Gaze Awareness” as key aspects of gaze alignment. Work done by Flom [8], describes gaze following as another major factor enabling the speaker to target the gaze directions of the listeners to shift the focus from himself/herself onto another entity. MacPherson and Moore [16] describe attention control by gaze cues describes the effects of gaze following, its development and significance. Cues such as head movement, finger pointing etc are used to direct user attention of listeners towards the direction associated with the cue.

Gaze directions are correlated with attention [27]. Sharma et al., used computer vision techniques to analyze head direction of students. This work concludes that a decrease in the head motion is correlated to diminished attention span in the classroom environment.

2.2 Existing gaze correction systems

Several gaze correction solutions today that tries to enable eye-contact between interacting parties. Most of these solutions focus on inducing a corrective warp on the video of a person’s face ensuring the offset angle between the camera and his/her gaze direction is seemingly nullified.

For instance, work done by Baek and Ho [2] requires a high definition setup of a display and a set of four cameras two mounted on top and two mounted at the bottom of the display. The camera interpolates between the viewpoints obtained by the cameras on top and bottom to generate a morphed face which renders a mutual gaze aligned view to the remote viewer. Work done by Ford D.A. and Silbermam [9] involves extracting user’s facial features such as eyes and substituting a corrected version of the feature to render a gaze corrected perspective of an individual participant. Likewise, work done by Eu-Tteum Baek and Yo-Sung Ho [2] involves morphing facial features to render eye-contact between interacting participants during video conferencing. Work done by Ruigang Yang and Zhengyou Zhang [32] involves a graphics hardware that syntheses eye contact based on stereo analysis combined with rich domain knowledge to synthesize a video that maintains eye contact. However, these systems often correct gaze misalignment over small angles. Similarly, work done by Jason Jerald and Mike Daily [13] involves real-time tracking and warping of eyes using machine learning algorithms appropriately on each frame giving a feeling of natural eye contact between the interacting participants. Such solutions are applicable only in bipartite video conferencing solutions. A similar approach is seen in the work of several others.

There are also several avatar/robot-aided gaze coherent environments for multiparty video conferencing systems [19, 25, 29].

These however, exhibit several limitations when extended to an eLearning scenario. In a large and dynamic setting such as a classroom environment, correcting an individual’s gaze using face morphing techniques are often impractical.

Cisco’s telepresence [12] provides a solution for multiparty tele-conferencing system aims that aims to preserve gaze directions by conserving relative positions of participants across the different participating locations. Each station is equipped with a plurality of cameras and displays positioned around the participants mimicking a roundtable conference meeting. However, this system offered a rigid solution as the user’s relative location is conserved across geographies which become impractical in an eLearning scenario. In eLearning scenarios, students benefit more from being in the center of the classroom [3]. Thus participants from several locations occupy the central location of the physical space before them while the remote students are displayed on the sides around the physical participant. This causes gaze discrepancies.

Other systems include using augmented reality and 3D reconstruction tools to 3D-contour extract a remote person and reconstruct the contour to present a life like video to the viewer [4, 23]. These systems require participants to wear head mounted AR displays. 3D contour extraction and reconstruction is process intensive. It is difficult to adapt such technologies to an extended classroom scenario.

Work done by Kamal Bijlani describes a technology enabling the exchange of multiple streams building a collaborative tool. The basic architecture of this system was employed as a platform to build our gaze alignment system [6]. Feedback mechanisms to enhance interaction was also employed from such systems [28].

Our system involves tracking interaction changes. Xbox Kinect 360 with OpenNi2 and NiTe2 libraries were used to track skeletal joings of the teacher. Arm joints were used to process the angular displacements calculating the pose of the arm. A skeleton tracked sample pose image of a teacher pointing at a display initiating switching is shown in Fig. 2 [24].

Fig. 2
figure 2

a Teacher’s skeleton tracked image. b Corresponding color frame of the skeleton tracked teacher

Microphone activities were monitored for discerning who the speaker is during an interaction. Work done by Guntha et al., [11] describes methods to cancel echo in multi-party conferencing systems. A successful implementation of the same was also carried out. Since our system involves exchanging numerous video streams, several optimization techniques were implemented such as priority tagging and throttling video quality of streams etc., [30]. Other optimization techniques were also employed as in the work done by Bijlani K, Rangan P [5].

Most of the above-mentioned methods describe a controlled scenario during video-conferencing enabling only a handful of participants to have a gaze aligned interaction. To bring about gaze alignment in live extended classroom teleteaching systems, it is necessary to study the nature of gaze during interactions and model gaze using vectors.

3 Problem description

In this section, we take a closer look at the problem of gaze misalignment by comparing gaze directionalities in three commonly occurring interaction scenarios in a conventional classroom, and adapting the same to a typical extended classroom setup. For the purpose of illustration, we describe a minimalistic three participating locations with three sets of students and a teacher collectively forming four participants.

  • Conventional Classroom: consists of student sets, S1, S2 and S3 seated around the teacher, T as shown in Fig. 3a. They are seated around the teacher as a single student-body, however, we demarcate them into three student bodies to facilitate the problem description.

  • Extended Classrooms: Consists of classrooms EC − 1, EC − 2 and EC − 3 each consisting of students, SEC− 1, SEC− 2 and SEC− 3. We assume teacher, T is arbitrarily present in EC − 2. These are shown in Fig. 3a, b and c. There exists a dedicated camera capturing the frontal perspective of each physical participant. These videos are displayed in other classrooms enabling virtual-presence of distance participants. Thus in effect, each extended classroom consists of all participants either physically or virtually present. The notation for camera is Cparticipant and the notation for a display is Dparticipantondisplay. Note: there can exist one or more displays with the same notation in different classrooms. In such cases they will be specifically qualified by describing the classroom they exist in.

Fig. 3
figure 3

a Conventional classroom setup. b, c and d setup in extended classrooms-1, 2 and 3

We represent gaze of a physical participant by green arrows and perceived gaze direction of a virtual participant, by the physical participant by red arrows as shown in Fig. 4.

Fig. 4
figure 4

Green arrow indicates gaze direction of physical participant, S1, red arrow indicates the perceived gaze direction of a virtual participant, S2

Considering the architectures above, let us consider three interaction scenarios in the conventional classroom and see how the three levels of gaze alignment (mutual gaze, gaze awareness and gaze following) fail when adapted to the extended classroom setup.

  1. 1.

    Mutual Gaze Scenario: T and S1 are talking to each other as observed during a typical question and answer - interaction.

    • Conventional Classroom:T looks at S1 and S1 looks at T as shown in Fig. 5. This depicts mutual gaze between interacting participants.

    • Extended Classroom: From the Fig. 6a, b, in EC − 1, SEC− 1 looks at the display, DT showing T. T’s video is captured by camera CT in EC − 2. In EC − 2, T looks at the display \(D_{S_{EC-1}}\) showing SEC− 1. Video of SEC− 1 is captured by the camera \(C_{S_{EC-1}}\) in EC − 1. In Fig. 6a, DT shows the view SEC− 1 sees of T. In Fig. 6b, \(D_{S_{EC-1}}\) shows the view T sees of SEC− 1.

      From Fig. 6a, it is evident from the view SEC− 1 sees of T on DT that SEC− 1 does not perceive T is gazing towards them. The gaze mismatch is evident. The extended classroom design fails to cater to mutual gaze between the interacting participants.

  2. 2.

    Gaze Awareness Scenario: S3 initiates a conversation with T as typically observed during a question and answer session.

    • Conventional Classroom: As shown in Fig. 7, all listeners (T, S1 and S2) orient their gaze towards S3 as depicted by green arrows. In addition, T, S1 and S2 can also observe each other looking at S3 . S3 also perceive all other participants look at them. Each participant set exhibits gaze awareness as they know where the neighboring participants are looking.

    • Extended Classroom: From Fig. 7a, b, c,

      • In EC − 3, SEC− 3 looks at the display, DT. In EC − 2, T and SEC− 2 both look at the display, \(D_{S_{EC-3}}\). In EC − 1, SEC− 1 looks at the display, \(D_{S_{EC-3}}\). Their gaze directions are indicated by green arrows. Videos of T, SEC− 1, SEC− 2 and SEC− 3 are captured by the cameras - CT, \(C_{S_{EC-1}}\), \(C_{S_{EC-2}}\) and \(C_{S_{EC-3}}\) respectively. The view SEC− 1 sees of T, SEC− 2 and SEC− 3 is shown on displays, DT, \(D_{S_{EC-2}}\) and \(D_{S_{EC-3}}\) and their gaze directions as perceived by SEC− 1 is indicated by green arrows. It is evident that SEC− 1 does not feel all other participants are looking at SEC− 3 . In fact, SEC− 1 is unable to discern where the distant participants are gazing thus there is no gaze awareness. Similarly, all the participants (viewers) in the other classrooms are unable to discern the gaze directions of the distant participants.

      • In addition, in EC − 2, displays, \(D_{S_{EC-1}}\) and \(D_{S_{EC-3}}\) are shared between T and SEC− 2. Displays, \(D_{S_{EC-1}}\) and \(D_{S_{EC-3}}\) in shows the viewpoints T and SB see of the distant participants. The directions, T perceives participants SEC− 1 and SEC− 3 are looking, are different from the directions SEC− 2 perceives SEC− 1 and SEC− 3 are looking when they both look at the displays, \(D_{S_{EC-1}}\) and \(D_{S_{EC-3}}\) as indicated by red arrows owing to Mona Lisa effect [26]. This further illustrates the lack of gaze awareness.

    In conclusion, each participant wonders where the distant participants are looking, thus the system does not support gaze awareness.

  3. 3.

    Gaze Following Scenario: T is lecturing and he/she decides to direct his/her attention towards S2 by pointing at S2.

    • Conventional Classroom:S1, S2 and S3 initially look at T (speaker) who is lecturing as indicated by the arrows in the first Fig. 8 (left). Suddenly, when T directs his/her attention towards S2, S1 and S3 follow the gaze direction of T and look towards S2. The final gaze directions are indicated by arrows as shown in Fig. 8 (right).

    • Extended Classroom: During a lecture, when T is speaking, all participants gaze at T or the display, DT in their respective classrooms. However, when T directs his/her attention towards SEC− 2 by finger pointing, the other participants are often unable to discern where to look at until they locate the ‘entity-at-focus’ by other contextual means. The system fails to cater to gaze following. (The diagram for gaze following has not been depicted as the effect observed is similar to what is shown in Fig. 7).

Fig. 5
figure 5

Interacting participants (T and S1) have their gaze directions oriented towards each other depicting Mutual Gaze

Fig. 6
figure 6

a Scenario in EC − 1. b Scenario in EC − 2. SEC− 1 does not perceive T look at them depicting lack of gaze awareness

Fig. 7
figure 7

EC − 3 is talking to T. a, b and c depict the scenarios in EC − 1, EC − 2 and EC − 3 where there is a mismatch in gaze directions of virtual participants depicting lack of gaze awareness

Fig. 8
figure 8

In a conventional classroom, as the speaker, T shifts his attention towards the participant, S2, all participants reorient their gaze directions following the Speaker

4 Gaze alignment framework

In this section, we present a formal framework to analyze the constituent factors that influence the interplay of gaze directions. We also discuss how to bring gaze directions into congruence, in turn, translating the formalism into real-world applications. The formalism includes

  1. (i)

    Entity position framework

  2. (ii)

    Vector geometric modeling of gaze

  3. (iii)

    Gaze behavior and Classroom interactions

  4. (iv)

    Modeling capture and display devices

Several existing solutions for gaze alignment often work well for a small number of participating extended classrooms. Most of them involve computer vision based solutions such as overlaying gaze aligned warped perspective of participant’s faces or assigning fixed stations to participants such that the relative positions of participants across the different locations are conserved. Such solutions often become un-intuitive when scaled to accommodate a large number of participants from different locations. Our formalism provides a framework to analyze gaze directions, study its behavior during interactions and thereby enabling us to architect a scalable, context-aware (knowing when to realign gaze directions) solution. The architecture requires a media-rich setup for a comprehensive gaze alignment system. The system assumes multiple cameras capturing various perspecives of participants at each location. A set of displays present videos of distant participants for the physical participants of that classroom.Our framework mathematically models all the factors affecting gaze. For a given interaction pattern, this framework is used to calculate which camera amongst a set of cameras around a remote participant should be chosen to pair with a display in another location. This camera to display mapping is done in such a way that the physical participants of each location viewing the remote particpants on displays see a gaze aligned perspective of them during each interaction. In a resource constraint environment, the framework also dictates the optimal placement of the entities to maximize gaze coherence during interactions.

4.1 Entity position framework

Gaze is highly sensitive to the relative positioning of entities across different classrooms. Here we describe position framework starting with the classroom structure and a coordinate system. Any object or participants such as a set of students from one location, teacher, displays, cameras etc., in the classroom environment is defined as an entity. Each entity, depending on its functional requirement, is placed in the classroom environment and has a unique position and orientation. We consider a set of students from an extended classroom of a location collectively as an entity. Each set of distant students collectively form a distant entity. We also describe the notations for these.

Classroom structure

A classroom can be of any shape or size and any arbitrary position can be chosen as the origin; however, for ease of illustration, without loss of generality, we assume the classroom has the physical structure of a cuboid. The classroom has a teacher (either physically or virtually present) on one side and the students seated on the opposite side. Center of the teacher’s-side wall is chosen as the origin with the z-axis coming out of the wall towards the students, y-axis tending upwards towards the ceiling of the classroom and x-axis towards the right side of students, following the right-handed coordinate system.

For ith extended classroom - ECi, where i ∈ 0, n − 1 for an n extended classroom system, we use the notation,

  • Length - lECi, Breadth - bECi, Height - hECi

  • Origin - OECi with ‘x, y and z axes’

Position and orientation of an entity in the classroom

Each entity has an entity axes and a point handle for defining location called Entity Reference Point - ERP. When an entity is placed in a classroom environment, the displacement of ERP from the classroom origin defines the position of the entity. The position is described using a 3 value tuple, (dx, dy, dz). The angular displacement of the entity axes from the classroom axes defines the orientation. This is again a 3 value tuple, (θx, θy, θz). Figure 9 shows the ERP and object axes of common entities and the location and orientation a sample entity.

Fig. 9
figure 9

a, b and cERP and Orientation (entity axes) of common entities. d the positioning of a sample entity (display), in a classroom environment displaced from the classroom origin

4.2 Vector geometric modeling of gaze

In this section, we give a formal definition to gaze directions using vectors. Formally, we define a gaze vector as a unit vector starting at the ERP of a gaze source towards the ERP of the gaze target. In our scenario, gaze sources include participants, cameras, teaching aids like whiteboard and even participants-or-cameras virtually displayed i.e., anything with eyes or eye-like features which has the capability to visually perceive can form a gaze sources. While there are infinitely many gaze directions from a gaze source, we will be only interested in two kinds of gaze vectors which are defined as follows,

  1. 1.

    Direct Gaze Vector, \(\mathbf {\overrightarrow {dgv}}\): is a gaze vector perpendicular to the plane-of-face of a gaze source. Examples include 1) a participant directly viewing a gaze target with his head oriented completely towards the target. 2) a camera pointing directly towards a gaze target. Figure 10 illustrates \(\overrightarrow {dgv}\). In this paper, we use dark green arrows to indicate physical participant’s \(\overrightarrow {dgv}\) and red arrows to indicate virtual participant’s \(\overrightarrow {dgv}\).

  2. 2.

    Peripheral Gaze Vector, \(\mathbf {\overrightarrow {pgv}}\): any gaze vector that is not the \(\overrightarrow {dgv}\). There are many \(\overrightarrow {dgv}\)’s for a gaze source targeted towards other entities such as cameras, displays and physical participants in a classroom. This commonly comes into play when a participant’s head is oriented towards an entity but he/she can still perceive the gaze directions of other entities. In this paper, we use light green arrows to indicate \(\overrightarrow {pgv}\)’s. Note that we define \(\overrightarrow {pgv}\)’s not only for participants but also for cameras. Figure 10 illustrates \(\overrightarrow {pgv}\).

Fig. 10
figure 10

S1’s \(\protect \overrightarrow {dgv}\) (dark green arrow) oriented towards T and S1’s \(\protect \overrightarrow {pgv}\)’s, (light green arrows) oriented towards other entities viz., S2 and S3. Another example shows the perceived gaze direction, \(\protect \overrightarrow {dgv}\), (red arrow) of a virtual participant, S2

We use the notation \(\overrightarrow {Source \rightarrow Target}\) type for representing a gaze vector where type is either \(\overrightarrow {dgv}\) or \(\overrightarrow {pgv}\). Given the positions of the gaze sources, (xs, ys, zs) and targets, (xt, yt, zt), we can calculate the gaze vector as,

$$\frac{(x_{t}-x_{s})\hat{i}+(y_{t}-y_{s})\hat{j}+(z_{t}-z_{s})\hat{k}}{\sqrt{(x_{t}-x_{s})^{2}+(y_{t}-y_{s})^{2}+(z_{t}-z_{s})^{2}}}$$

where \(\hat {i}\), \(\hat {j}\) and \(\hat {k}\) are unit vectors along x, y and z axes.

Based on the gaze source, gaze vectors can be also classified as real or virtual.

  1. 1.

    Real Gaze Vector: is a gaze vector where gaze source is real or physically present in the classroom. For example, when a physically present participant or camera views another entity, a real gaze vector emanates from the viewer towards the viewed entity. We use green arrows to indicate these.

  2. 2.

    Virtual Gaze Vector: is a gaze vector where the gaze source is a virtual entity. For example, a distant teacher on a display seemingly looking at another entity. We use red arrows to indicate these.

4.3 Modeling gaze alignment with gaze vectors

In addition to the gaze vectors of entities in the extended classrooms, we also define an entity-at-focus which is the primary entity onto which the entire classroom environment is actively focused upon. Entity-at-focus could be a participant or any object that aids the lecture session such as a whiteboard, presentation screen, 3d model etc. The three levels of gaze alignment can be mathematically modeled using gaze vectors as follows

  1. 1.

    Mutual gaze: For mutual gaze, the interacting participants should face each other with their \(\overrightarrow {dgv}\)’s collinear and in opposite directions.

  2. 2.

    Gaze Awareness: Each participant is aware that the other participants are focusing on the same entity-at-focus. To emulate this, the target of all \(\overrightarrow {dgv}\) in all extended classrooms should be concurrent on the entity-at-focus at that point in time.

  3. 3.

    Gaze Following: When the speaker gestures and shifts the attention of the listeners from himself/herself to another entity-at-focus, the \(\overrightarrow {dgv}\) of all participants which were previously concurrent on the speaker should be reoriented such that it is concurrent on the entity-at-focus.

These are shown in Fig. 11.

Fig. 11
figure 11

Depicts mutual gaze, gaze awareness and gaze following

4.4 Gaze behavior and classroom interactions

We now study the behavioral patterns of gaze and see how these patterns translate into interaction states. Work done by Marks, Max [17], categorizes classroom interactions into the following states viz.,

  1. 1.

    Lecturing State - LS

    • It is a one-to-many interaction where teacher - T takes the role of the speaker and other participants become listeners.

    • T’s \(\overrightarrow {dgv}\) is predominantly focused towards the center of the student body.

    • The \(\overrightarrow {dgv}\) of all students are concurrent on the teacher or the display showing the teacher.

  2. 2.

    Question and Answer State - QnAS

    • It is a one-to-one interaction where T interacts with a single participant - P (target listener).

    • T and P have their \(\overrightarrow {dgv}\) collinear and in opposite directions.

    • Other listeners’ \(\overrightarrow {dgv}\) are oriented towards the speaker at that point in time.

  3. 3.

    Discussion State - DS

    • In this state, the participants interact amongst themselves. We assume a certain level of orderliness wherein only one participant speaks at a point in time. This interaction can be both a one-to-one or one-to-many type.

    • If it is a one-to-one interaction, interacting participant’s \(\overrightarrow {dgv}\)’s are collinear and in opposite directions. The other passive listeners have their \(\overrightarrow {dgv}\)’s directed at the speaking participants.

    • If it is a one-to-many interaction, all listeners \(\overrightarrow {dgv}\)’s are concurrent on the speaker and the speaker’s \(\overrightarrow {dgv}\) oscillates spanning the listeners.

Note: Often participants gaze directions switch between the learning aid and the source of information (teacher, teaching aid, white board or presentation screen). For instance, while taking down notes, the participants oscillate their gaze direction between the teacher/board and notebooks/laptops. This behavior forms a part of the interaction process and does not contribute to gaze misalignment. In such scenarios, the gaze direction is assumed to be towards the source of information.

4.5 Modeling capture and display devices

Cameras and displays often cause changes in gaze directions during capture and display. For instance, many cameras laterally invert the image after capture, displays introduce Mona Lisa effect. We can model these effects as Transfer Functions - F.

$$\overrightarrow{Output Gaze Vector}=F * \overrightarrow{Input Gaze Vector}$$

A transfer function simply models the properties of a device that causes change in gaze vectors. For instance if we were to capture the video of a participant on a camera that has inversion and play it back on a flat screen display, the net perceived output gaze vector is

$$\overrightarrow{Output Gaze Vector}=F_{display} * F_{camera} * \overrightarrow{Input Gaze Vector}$$

In the above example, we used a camera with lateral inversion. Suppose the camera does not introduce any such changes then its transfer function is assumed to be identity. In our scenario, we know the desired effect we want to obtain; for example if the teacher is speaking, the video of a remote participant should look at the direction of the teacher as perceived by the physical participant. By modelling the property of cameras and displays as transfer functions, we can position them appropriately to produce the desired effect. Taking this further, if we do not have a camera at a required position, we might be able to laterally invert the video frames captured by another camera to obtain the desired effect. In this case, the computer system capturing the video from this camera and laterally inverting the frames is associated with a transfer function rather than the camera itself.

Thus we also present how the transfer functions translate into real-world applications dictating the positions of cameras and displays. Since all the entities of concern in the classroom are on the floor positioned roughly at eye-level of either a sitting or standing participant, we only consider the variations along the xz plane. We use the letter F to denote a Transfer Function and it is used as follows.

Camera transfer function

Cameras may produce modifications such as lateral inversion, fish-eye effect, pin-cushion effect etc. This causes changes to the gaze vectors. An ideal camera produces negligible amounts of the above mentioned effects and the transfer function can be assumed as identity given by,

$$F_{camera}= \left[\begin{array}{ll} 1 & 0 \\ 0 & 1 \end{array}\right] $$

There are certain cameras that laterally inverts the image as it is being captured. This causes the gaze vectors perceived when viewing the captured video on a display to be laterally inverted. If this is the case, let us now derive the camera transfer function for lateral inversion. Consider a participant, P looking at an object, O. The video of the participant is captured by a camera, C which is oriented towards P as shown in Fig. 12.

  1. 1.

    Vector \(\overrightarrow {{C \rightarrow P}_{dgv}}\) is a direct gaze vector from C to P.

  2. 2.

    Vector \(\overrightarrow {{P \rightarrow O}_{dgv}}\) is a direct gaze vector from P to O.

  3. 3.

    \(\theta =\measuredangle {(-\overrightarrow {C \rightarrow P_{dgv}}, \overrightarrow {P \rightarrow O_{dgv}})}\) (as shown in the Fig. 12) \(=\cos ^{-1}\left (\frac {(-\overrightarrow {C \rightarrow P_{dgv}}. \overrightarrow {P \rightarrow O_{dgv}})} {\left |-\overrightarrow {C \rightarrow P_{dgv}}\right |\left |\overrightarrow {P \rightarrow O_{dgv}}\right |}\right )\)

Since \(\overrightarrow {C\rightarrow P_{dgv}}\) and \(\overrightarrow {P \rightarrow O_{dgv}}\) are unit vectors, their modulus = 1.

Thus,

$$ \theta = \cos^{-1}\left( -\overrightarrow{C\rightarrow P_{dgv}}.\overrightarrow{P \rightarrow O_{dgv}}\right) $$
(a)
Fig. 12
figure 12

P looking at O captured by a camera, CP and is laterally inverted to give a perception that P is viewing \(O^{\prime }\)

When the image is laterally inverted, the participant’s gaze direction is reflected about \(-\overrightarrow {C \rightarrow P_{dgv}}\) and the image appears such that the P is viewing O with the gaze vector, \(\overrightarrow {{P \rightarrow O^{\prime }}_{dgv}}\) and \(\measuredangle {\left (\overrightarrow {{P \rightarrow O^{\prime }}_{dgv}}, -\overrightarrow {{C \rightarrow P}_{dgv}}\right )}=\theta \), Thus in effect \(\overrightarrow {{P \rightarrow O}_{dgv}}\) gets transformed to \(\overrightarrow {{P \rightarrow O^{\prime }}_{dgv}}\) by a rotation of − 2θ degrees. This is shown in Fig. 12.

from equation a

$$\measuredangle \left( \overrightarrow{{P \rightarrow O}_{dgv}}, -\overrightarrow{{C \rightarrow P}_{dgv}}\right) = 2\theta = -2\left( cos^{-1}{ \left( -\overrightarrow{{C \rightarrow P}_{dgv}}. \overrightarrow{{P \rightarrow O}_{dgv}}\right)}\right) $$

To rotate a vector by an angle, − 2θ, the basic 2D rotation transform is described by the camera transformation matrix where, \(\theta = cos^{-1}\left (-\overrightarrow {{C \rightarrow P}_{dgv}}. \overrightarrow {{P \rightarrow O}_{dgv}}\right )\).

$$ F_{camera}= \left[\begin{array}{ll} \cos(-2\theta) & -\sin(-2\theta) \\ \sin(-2\theta) & \cos(-2\theta) \end{array}\right] = \left[\begin{array}{ll} \cos 2\theta & \sin 2\theta \\ -\sin 2\theta & \cos 2\theta \end{array}\right] $$
(1)

Display transfer function

Displays exhibit a unique property where the angle between the virtual participant’s gaze vector (virtual gaze vector) on display and the viewer’s gaze vector (real gaze vector) remains constant even if the viewer moves. This is known as the Mona Lisa effect.

  • To derive the display transfer function, let us consider two classrooms, EC − 1 and EC − 2 each hosting participants PEC− 1 and PEC− 2 respectively. PEC− 2 is viewing an object and is captured by camera - \(C_{P_{EC-2}}\). PEC− 1 sees the video of PEC− 2 on a display - \(D_{P_{EC-2}}\). This is shown in Fig. 13. Our motive is to determine the gaze vector of PEC− 2, \(\overrightarrow {{P_{EC-2} \rightarrow *}_{dgv}}\) as perceived by PEC− 1 viewing \(D_{P_{EC-2}}\).

  • We know, In EC − 1, \(\overrightarrow {{P_{EC-1} \rightarrow P_{EC-2}}_{dgv}}\) and, in EC − 2, \(\overrightarrow {{C_{P_{EC-2}} \rightarrow P_{EC-2}}_{dgv}}\) and \(\overrightarrow {{P_{EC-2} \rightarrow {object}}_{dgv}}\).

  • \(\measuredangle \left (-\overrightarrow {C_{P_{EC-2}} \rightarrow P_{EC-2_{dgv}}}, \overrightarrow {{P_{EC-2}\rightarrow object}_{dgv}}\right ) = \theta \)

  • \(\theta = \cos ^{-1}\left (\frac {-\overrightarrow {{C_{P_{EC-2}} \rightarrow P_{EC-2}}_{dgv}}. \overrightarrow {{{P_{EC-2}} \rightarrow object}_{dgv}}} {\left | \overrightarrow {{C_{P_{EC-2}} \rightarrow {P_{EC-2}}}_{dgv}} \right | \left | \overrightarrow {{P_{EC-2} \rightarrow object}_{dgv}} \right | }\right )\)\(=\cos ^{-1}\left (-\overrightarrow {{C_{P_{EC-2}} \rightarrow P_{EC-2}}_{dgv}}. \overrightarrow {{P_{EC-2} \rightarrow object}_{dgv}} \right )\), (as gaze vectors are unit vectors)

Fig. 13
figure 13

Mona Lisa effect: PEC− 1 in EC − 1 is viewing PEC− 2 on display, \(D_{P_{EC-2}}. P_{EC-2}\) in EC − 2 looking at object, O is captured by camera, \({C}_{{P}_{E{C}_{2}}}\)

To find \(\overrightarrow {{P_{EC-2} \rightarrow *}_{dgv}}\), we rotate \(-\overrightarrow {P_{EC-1} \rightarrow {P_{EC-2}}_{dgv}}\) by − θ degrees (−ve for clockwise rotation). The rotation transform is given by

$$F_{display} = \left[\begin{array}{ll} \cos(-\theta) & -\sin(-\theta) \\ \sin(-\theta) & \cos(-\theta) \end{array}\right] = \left[\begin{array}{ll} \cos \theta & \sin \theta \\ -\sin \theta & \cos \theta \end{array}\right] $$
(2)

4.6 Applications of transfer functions

Our solution relies on providing the appropriate perspective of distant participants to physical participants for every interaction scenario. In this process, it often becomes necessary to position a movable camera or to pick the appropriate camera from an array of cameras focused on a distant participant. In this section, we discuss two scenarios as follows,

Scenario 1: Let us assume three classrooms, EC − 1, EC − 2 and EC − 3 consisting of participant PEC− 1, PEC− 2 and PEC− 3 respectively. We take a sample interaction i.e. PEC− 3(speaker) is talking toPEC− 2. Let us observe gaze directions in EC − 1 and EC − 2 as shown in Fig. 14. In EC − 2, PEC− 2 is looking at \(D_{P_{EC-3}} (\)light green arrow). PEC− 2’s video is captured by an ideal camera, \(C_{P_{EC-2}}\) mounted on a semicircular rail enabling the capture of all the frontal perspectives. In EC − 1, PEC− 1 is looking at speaker’s display, \(D_{P_{EC-3}}\) using \(\overrightarrow {dgv} (\)dark green arrow). PEC− 1 can perceive the gaze direction of PEC− 2 on \(D_{P_{EC-2}}\) using \(\overrightarrow {pgv} (\)light green arrow). Problem: Given this interaction, where to position the camera, \(\mathbf {C_{P_{EC-2}}}\) on the rail in EC2 such that in EC1, PEC− 2 on \(\mathbf {D_{P_{EC-2}}}\) appears to look at \(\mathbf {D_{P_{EC-3}}} (\)red arrow denoting virtual gaze vector).

Fig. 14
figure 14

Position \(C_{P_{EC-2}}\) in EC − 2 such that PEC− 1 in EC − 1 observes PEC− 2 on \(D_{P_{EC-2}}\) viewing \(D_{P_{EC-3}}\)

Let us calculate the perspective of \(P_{EC-2} (\overrightarrow {dgv}\) of the camera, \(C_{P_{EC-2}})\) to be displayed on \(D_{P_{EC-2}}\) in EC − 1.

Given:

  1. 1.

    Since, PEC− 3 is talking to PEC− 2, in EC − 2, PEC− 2’s \(\overrightarrow {dgv}\) is oriented towards \(D_{P_{EC-3}}\).

  2. 2.

    PEC− 1 in EC − 1, observes PEC− 2 on display, \(D_{P_{EC-2}}\) with his/her \(\overrightarrow {pgv}\).

  3. 3.

    PEC− 1 in EC − 1, should observe PEC− 2’s virtual gaze vector oriented towards \(D_{P_{EC-3}}\).

Let us calculate the camera position of \(C_{P_{EC-2}}\):

$$\measuredangle{\left( \overrightarrow{{D_{P_{EC-2}} \rightarrow D_{P_{EC-3}}}_{dgv}}, -\overrightarrow{{P_{EC-1} \rightarrow D_{P_{EC-2}}}_{pgv}}\right)} = \theta$$

We rotate \(\overrightarrow {{P_{EC-2} \rightarrow D_{P_{EC-2}}}_{dgv}}\) by and angle, − θ and multiply the resulting by −1 to invert it thus we obtain the resulting gaze vector of the camera.

$$ \overrightarrow{dgv} of C_{P_{EC-2}} = \left[ -1 * \left[\begin{array}{ll} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{array}\right] \right] * \overrightarrow{{P_{EC-2} \rightarrow D_{P_{EC-3}}}_{dgv}} $$
(3)

Incidentally, the function in the square brackets is the corollary of the display transfer function. Let us substitute the above scenario with actual values reflecting their position in meters. As explained in the start of Section 4.4, we neglect the variations along y-axis and the Table 1 enumerates the positions of entities present in EC − 1 and EC − 2.

Table 1 Enumerates the positions of all entities in EC–1 and EC–2

Let us calculate the gaze vectors viz.,

  1. 1.

    \(\overrightarrow {{P_{EC-1} \rightarrow D_{P_{EC-2}}}_{pgv}} = \frac {\overrightarrow {{(-3-0)\hat {i}+(2-5)\hat {k}}_{pgv}}}{\sqrt {(-3-0)^{2}+(2-5)^{2}}} = \overrightarrow {{-\frac {1}{\sqrt {2}}\hat {i} -\frac {1}{\sqrt {2}}\hat {k}}_{pgv}}\)

  2. 2.

    \(\overrightarrow {{{D_{P_{EC_{2}}}} \rightarrow {D_{P_{EC-3}}}}_{dgv}} = \frac {\overrightarrow {{(3 + 3)\hat {i}+(2-2)\hat {k}_{dgv}}}}{\sqrt {(3 + 3)^{2}+(2-2)^{2}}} = \overrightarrow {i_{dgv}}\)

  3. 3.

    \(\overrightarrow {{P_{EC-2} \rightarrow D_{P_{EC-3}}}_{dgv}} = \frac {\overrightarrow {{(1-0)\hat {i}+(1-5)\hat {k}}_{dgv}}}{\sqrt {(1-0)^{2}+(1-5)^{2}}}\)

From EC − 1,

$$\theta = \cos^{-1}{\frac{{\overrightarrow{-P_{EC-1} \rightarrow {D_{P_{EC-2}}}_{pgv}}}. \overrightarrow{{{D_{P_{EC_{2}}}} \rightarrow D_{P_{EC-3}}}_{dgv}}}{\left| \overrightarrow{{{P_{EC-1}} \rightarrow D_{P_{EC-2}}}_{pgv}} \right| \left| \overrightarrow{{D_{P_{EC-2}} \rightarrow D_{P_{EC-3}}}_{dgv}} \right|}} = 45^{0}$$

In EC − 2, we rotate \(\overrightarrow {{P_{EC-2} \rightarrow D_{P_{EC-3}}}_{dgv}}\) by − θ(the −ve sign denotes rotation in the clockwise direction),

$$\overrightarrow{dgv} of C_{P_{EC-2}} = \left[ -1 *\left[\begin{array}{ll} \cos -45 & \sin -45 \\ -\sin -45 & \cos -45 \end{array}\right] \right] * \overrightarrow{{\frac{1}{\sqrt{17}} \hat{i}-\frac{4}{\sqrt{17}}\hat{k}}_{pgv}}$$
$$=\overrightarrow{{{-\frac{5}{\sqrt{34}}}\hat{i}+\frac{3}{\sqrt{34}}\hat{k}}_{dgv}}$$

Thus the angle the camera has to be placed is, \(\tan ^{-1}\left (-\frac {3/\sqrt {34}}{5/\sqrt {34}}\right ) \approx 31^{0}\).

Scenario 2: Let us add a slight complication to the scenario above by assuming that \(C_{P_{EC-2}}\) is a laterally inverting camera as shown in Fig. 15. In such a case, to obtain the correct camera position, \(C_{P_{EC-2}}\) has to be displaced from \(\overrightarrow {{P_{EC-2} \rightarrow D_{P_{EC-3}}}_{dgv}}\) by θ(rotation in anticlockwise direction) or the current \(\overrightarrow {dgv}\) of \(C_{P_{EC-2}}\) from scenario 1 should rotate by 2θ in the anticlockwise direction.

$$ \overrightarrow{dgv} of C_{P_{EC-2}} with lateral inversion = \left[\begin{array}{ll} \cos 2\theta & \sin 2\theta \\ -\sin 2\theta & \cos 2\theta \end{array}\right] * \overrightarrow{dgv} of C_{P_{EC-2}} $$
(4)
Fig. 15
figure 15

Position \(C_{P_{EC-2}}\) in EC − 2 of scenario 1 assuming it is a laterally inverting camera

Equation 4 depicts the corollary fo the camera transfer function. Substituting actual values from scenario 1,

$$ \begin{array}{ll} \overrightarrow{dgv} of C_{P_{EC-2}} with lateral inversion = \\ \left[\begin{array}{ll} \cos 90 & \sin 90 \\ -\sin 90 & \cos 90 \end{array}\right] * \overrightarrow{{-\frac{5}{\sqrt{34}}\hat{i} + \frac{3}{\sqrt{34}}\hat{k}}_{dgv}} =\overrightarrow{{\frac{3}{\sqrt{34}}\hat{i} + \frac{5}{\sqrt{34}}\hat{k}}_{dgv}} \end{array} $$
(1)

And the angle the camera has to be placed is, \(\tan ^{-1}\left (\frac {-5/\sqrt {34}}{3/\sqrt {34}}\right ) \approx 121^{0}\)

Equations 1 and 2 discuss the effect of camera and display devices on gaze vectors. However, their corollaries as in (3) and (4) enable us to translate the properties of these devices to identify the appropriate camera position to produce the desired effect. Thus in this section, we see that, by appropriately positioning media devices, for a given interaction, the correct perspective of a distant participant can be displayed for the physical participant. However, it is not practical to use a movable camera on a rail in classroom situations as the entire classroom world spins as the camera moves during every interaction. In such scenarios, we usually use an array of cameras positioned around each participant.

The gaze theory can extend itself to several applications other than media positioning. Current technologies enable 3D surface contour segmentation and reconstruction. A participant’s 3D surface can be mapped using techniques such as “point cloud” and reconstructed. Gaze theory dictates appropriate projection of the 3D reconstructed model for a gaze aligned interaction.

5 Self adapting gaze alignment architecture-SAGA

In this section, we describe a scalable gaze alignment architecture that adapts on the fly to the different interaction states in a classroom environment for all the levels of gaze alignment. This media-rich setup consists of a system of cameras and displays surrounding each set of participants. The entities are located in predetermined positions in each of the extended classrooms to maximize the interactive experience. Figure 16 shows the position of all the entities in the teacher’s classroom, EC − 0 as well as a sample of a remote classroom, EC − 1, EC − 2 to EC − (n − 1).

Fig. 16
figure 16

SAGA n-classroom architecture. a describes EC − 0, b describes setup in ECk where k ∈ (1, n)

5.1 Architecture description

In EC − 0, as Fig. 16 shows, T and SEC− 0 are provided with separate displays as they require different perspectives of the same virtual participant owing to Mona Lisa Effect. Distances dT and DDD (between displays), dTC and \(d_{C-S_{EC-0}}\) (radius of camera positioning), \(d_{T-S_{EC-0}}\) (between T and SEC− 0) can be chosen based on space constraints, and user convenience. We slightly modify the notation of displays to Ddistantparticipantviewer since there are dedicated displays for each remote participants per viewer and since there is more than one camera shooting the videos of a participant, we use the notation \(C_{participant}^{angle}\) where angle describes the orientation of the camera that is shooting the participant.

The architecture consists of an array of cameras surrounding each participant arranged along a semicircle capturing the frontal perspectives. A camera to display mapping for all the interactions is calculated as described in Section 4.6 (Applications of Transfer Functions). With a finite number of cameras and its positioning, there may not always exist a camera capturing the required perspective. In such cases, the nearest camera is chosen. This results in a gaze quantization error, Δe. The choice of number of cameras required translates into a cost vs experience trade off. The number of cameras and its placement are dependent on the following, 1) Number-of-camera per participant vs experience for fixed number of classrooms; more number of cameras yield finer perspectives of distant participants enabling lesser gaze quantization errors, 2) Optimal placement strategy given a fixed number of cameras. Calculating the camera-to-display mapping table for all interactions provides a map of where placing camera is necessary and void regions, 3) Software induced lateral inversion can also reduce the number of cameras required. Certain perspectives can be emulated by artificially inducing lateral inversion on the video frames captured by cameras shooting other perspectives. In our observation, the lateral inversion of a participant’s perspective does not appear unnatural. 4) Participants are more sensitive to mutual-gaze error than gaze-awareness or gaze-following errors [10]. A detailed description of all the experiment illustrating the above points are beyond the scope of this paper.

Our test setup consists of five cameras with equiangular displacement positioned around the front of the participants. This resulted in a maximum gaze quantization error of 15.46 degrees (as shown in Table 3 in the Appendix section) which was observed to be tolerable in our experiments.

5.2 Interaction state change triggers

The system changes perspectives of distant participants based on interaction state change triggers. Mechanism of providing state change triggers can be done in a variety of ways. A simple button interface depicting the names of different locations can be placed in front of the teacher’s station. When a distant participant gestures the teacher for intent to interact through a simple hand-raise gesture, the appropriate button on the teacher’s button console can be pressed to initiate the interaction along with appropriate perspective switching. Another mechanism involves simple hand gesture recognition of the teacher and if need be participants for better results. The teacher simply has to point to the appropriate display for a predetermined threshold duration of time to trigger switching. During each interaction, the two interacting participants interchangeably take the role of the speaker. As the speaker changes, a perspective switching has to be made. We monitor the microphone activity to dynamically switch perspectives. Gesture triggers augmented with microphone activity based switching affords several behavioral pattern analysis prediction studies for efficient switching. For example, typically to start an interaction with the teacher a student raises his hand and calls out “excuse me I have a question” and the teacher points at the display showing that student indicating affirmation to proceed with the interaction. The duration of teachers point can be reduced to enhance naturalness of gesture behavior as the microphone activity clearly indicates intent to interact. However, we restrict the scope of this paper only to gaze alignment.

5.3 Camera-display mapping

With the setup and different interaction states, we have to select the camera such that gaze alignment is preserved. The procedure is a slight generalization of the example presented in Section 4.5, where the solution is now given as a set of equations. The optimal solution depending on the interaction state, number of classrooms, resources available (number of cameras) is shown in Table 3 in Appendix A (Camera-Display mapping Table with gaze quantization errors).

6 Evaluation

6.1 Three classroom implementation testbed

We constructed a three-classroom testbed consisting of equiangular displaced, five camera setup surrounding each participant set and a Microsoft Kinect interface for teacher gesture tracking. Figure 17 shows the positions of all the entities in the classroom environment along with entity distances and positions.

Fig. 17
figure 17

SAGA implementation of a three-classroom testbed. The numerical values represent distances in meters

Several prior studies have proven that gaze aligned interaction in a classroom indeed enhances the interaction experience [31]. Here, we show that our system indeed transforms a classroom interaction session successfully from a gaze insensitive experience to a coherent gaze aligned one. Table 3 shows the camera to display mapping table for all the interactions in each of the classrooms. The gaze quantization error, Δe is obtained by calculating the difference in the actual position of the camera in the classroom and the required position for perfect gaze alignment. Figure 18 shows snapshots of our implementation.

Fig. 18
figure 18

Implementation of SAGA showing SEC− 1 asking a question to T and arrows indicate gaze alignment. aSEC− 0’s view in EC − 0, bT’s view in EC − 0, cSEC− 1’s view in EC − 1 and dSEC− 2’s view in EC − 2

6.2 Instrumentation details

Gstreamer-1.0 API with H264 encoding was used to send several streams from one location to the other. We used Dell Optiplex 9010 which hosted a pentium i7 processor with 8 GB RAM for encoding and decoding. We used Sony’s 46 inch displays with 1920×1080 resolution. The cameras used for capture were Sony HDR-PJ380. Homogeneity in the instrumentation was to ensure the differences in quality of capture/display does not introduce bias in the evaluation.

6.3 Survey scenario

The survey scenario consisted of

  1. 1.

    5 minute session with the goal of measuring the time taken to identify the ‘entity-at-focus’ in an eLearning environment.

  2. 2.

    30 minute session to study the difference in effect when exposed to longer sessions of both classes.

The three classroom testbed was evaluated with two sets of 30 evaluators who attended an interaction intensive lecture for a 5 minute session. They were asked to be seated in one of the remote classrooms (chosen arbitrarily as all remote classrooms are homogeneous) and the video streams of the remote participants were presented to the evaluators. There was only one audio sink (audio speaker device) in the classroom where the audio streams from other distant classrooms were played back in cohesion. This ensured that the students had to indeed use visual cues to discern directions instead of directional audio. The first set of 30 participants were exposed to ‘non gaze corrected’ system initially and then to the ‘gaze corrected’ one. For the second set, the order of exposure to the systems were inverted ensuring counterbalance.

Objective evaluation

During the five minute session, the evaluators took the role of passive listeners and their videos were recorded. The session portrayed a debate - “Do you think languages such as Python should be introduced first in the bachelors curriculum or C+ +”. The debaters had strong opinions and it was interaction intensive with several interaction state changes. Post debate, the evaluator’s video was analyzed offline for the difference in time (Δi, i for interaction) between “the start of speaking” by a speaker and time to “identify who was speaking” by the evaluators head turn. Δis for the first 10 interaction state changes, for each evaluator was recorded using a stopwatch timer. The meanΔis for all the evaluators is shown in Fig. 19. Collective mean of all the evaluators for non gaze-aligned scenario is 1.67 seconds and for gaze-aligned scenario is 0.77 s. This shows a 46% reduction in average time taken to discern the entity-at-focus in a classroom. We also conducted a paired t-test on the sample and results indicated gaze corrected system indeed reduces time taken to discern ‘entity-at-focus’ in a classroom ((p = 9.54x10− 36) < 0.001). This shows an enhanced fluidity in interactions in a classroom environment.

Fig. 19
figure 19

Mean time taken in seconds by evaluators to identify the entity-at-focus

About 15 new evaluators of the first year masters class from the Department of Wireless Networks and Applications were asked to attend two 30 minute lectures with and without gaze correction (they were allowed to interact with the remote participants). An interesting notable observation was, overtime, for the non-gaze-corrected scenario, the evaluators stopped following the entity-at-focus and tuned in only to the audio of the class (unless there was a dire need to look at the video). For the gaze-corrected scenario, evaluator’s gaze direction continued following the display containing the entity-at-focus. This further asserts the argument that our gaze-correction system relatively eases the cognitive load in discerning the entity-at-focus when attending a gaze aligned lecture session.

Subjective evaluation

The subjective evaluation was carried out with a set of questionnaires as in Appendix. The motive of the questionnaire was to answer the following research questions - RQ.

  • RQ1: How effective was the system?

  • RQ2: Was the system natural to use?

  • RQ3: Did the system introduce any artifacts/distractions?

  • RQ4: Would you prefer taking more classes with the system?

A group of 15 masters students from the Department of Wireless Sensor Networks and Applications, 12 PhD students from the business school, 2 PhD students English and 9 PhD students and 6 research associates from across the campus of various allied backgrounds in engineering and science were asked to evaluate the system. With a total of 44 evaluators, some of the threats to validity that we tried to address were

  1. 1.

    Non-homogeneous crowd of evaluators to ensure lesser bias.

  2. 2.

    Evaluation was conducted in batches but instructions were given to them prior to the experience not to ask or divulge any information to those yet to undergo the experiment.

  3. 3.

    Evaluation was conducted in smaller batches to ensure the seating position during the evaluation does not disadvantage them.

  4. 4.

    We designed pre and post-questionnaire using standard methods and scales [21].

A set of pre-questionnaires and post-questionnaires were used for evaluation. The pre-questionnaire was designed to understand the background of the evaluators to try to reduce the risk of bias. This was given to the evaluators prior to the start of the experiment. The list of pre-questionnaire with the evaluation results can be found in the Appendix section.

The post-questionnaire was designed to measure the QOE/effectiveness of the system. Results of the post-questionnaire are discussed below categorized to answer individual research questions (Figs. 202122 and 23).

Fig. 20
figure 20

RQ1 responses

Fig. 21
figure 21

RQ2 responses

Fig. 22
figure 22

RQ3 responses

Fig. 23
figure 23

RQ4 response

RQ1: How effective was the system?

  1. a)

    Was there a marked difference between the first scenario (non gaze-aligned) and second scenario (gaze-aligned).□ Yes □ Yes, it was certainly better but not really a game changer □ No

  2. b)

    “I was able to identify who is speaking to whom must by looking at the video.”□ Yes □ No □ I found both to be equally easy

Results from the two questions indicate the gaze-aligned system was indeed effective when compared to the non-gaze aligned system.

RQ2: Was the system natural to use?

  1. a)

    Was the gaze-aligned system natural to used? Rate it on a scale of 1(unnatural) to 5(natural)? (x<3 - not supportive; x≥3 - supportive) □ 1 □ 2 □ 3 □ 4 □ 5

  2. b)

    Did you feel the gaze aligned system was better for interaction? □ Yes □ No □ Either way was the same to me

All the scores ranged between 3, 4 and 5 indicating the system was natural to be used.

RQ3: Did the system introduce any artifacts/distractions?

  1. a)

    Was the video switching during interaction change noticeable? 1(was horribly noticeable), 5(did not really notice it); (x≤3 - not supportive; x>3 - supportive). □ 1 □ 2 □ 3 □ 4 □ 5

  2. b)

    Did you find the video switching distracting? □ Yes □ No □ It was a bit distracting but I can live with it

  3. c)

    Would you get used to the switching? □ Yes □ No □ I don’t know □ I did not notice it so there is no question of getting used to

  4. d)

    Did the camera setup make you self conscious, since there were five cameras pointed at you? □ Yes □ No

Results indicate that evaluators knew some change in perspective had happened. But overtime they tuned out of it. Post evaluation we obtained a comment, “In movies, we don’t notice switching so much. It was something like that”.

RQ4: Would you prefer taking more classes with the gaze aligned system?

  1. a)

    Would you like to have more lectures with the gaze-aligned system? □ Yes □ No

For each of the questions, we conducted a ‘chi-squared test’ and the results are elaborated in Table 2.

Table 2 Chi square test results for RQ1 to RQ4

The p-values obtained indicate that the ‘gaze aligned’ system gave a significantly enhanced experience as compared to the ‘non gaze aligned’ system. Results based on RQ1, RQ2 and RQ4 corroborates with prior studies [8, 18, 31] indicating that multi-level gaze alignment results in a natural classroom interaction environment. Our current implementation introduced minute jitters during switching. This caused mild annoyance as indicated by results of tests on RQ3. This can be rectified by pre-fetching a Group Of Frames (GOP) of the new video feed before discarding the old one.

7 Conclusion

In this paper, we have developed a formalism to describe gaze analysis and we also proposed a scalable architecture for gaze alignment. We constructed a three-classroom test bed to evaluate the effectiveness of our architecture. We conducted several experiments and our test results indicated a whopping 46% percent reduction in the time taken to identify the entity-at-focus. This directly indicates a significant ease user cognitive load translating to enhanced fluidity in eLearning classroom interactions. Subjective evaluation results also concurred.

Though more than 80% evaluators described the gaze-aligned system as being effective, we believe there is scope for several improvements. This system can be enhanced by using cameras with smaller form factor or cameras that are not easily noticeable thereby making the participants less self conscious. We also feel the ‘jitter’ introduced during video switching can be reduced. We intend to fix these issues in the future.