Learning is a fundamental part of human nature. The knowledge acquired from learning new skills helps individuals change their cognition and affection, the center of human growth and development, and is hoped to be the mean for happiness, safety, emancipation, productivity, and societal success. Education, as the set of all planned learning processes and activities, is a “means by which men and women deal critically and creatively with reality and discover how to participate in the transformation of their world” (Freire, 1970).

Despite being so important in the development of an individual, learning is not always easy. In 1978, Vygotsky explained the difficulty of learning by introducing the Zone of Proximal Development (Vygotsky, 1978) indicating the psychological processes that the learner can reach with the support of knowledgeable guidance. According to Vygotsky, there are certain skills and competencies that the learner can only acquire if given the right support. With the right guidance, each learner can stretch outside of the zone of comfort and can experience and learn new skills and concepts. Besides external guidance, also internal factors play a determining role in learning success. Those are, for example, motivation to learn (Pintrich, 1999), the self-determination of an individual (Ryan & Deci, 2000) or metacognitive skills like self-regulation (Winne & Hadwin, 1998; Zimmerman, 2002), and the right set of dispositions (Shum & Crick, 2012), skills, values, and attitudes. Among all these individual factors, the learning science field also provides best practices to instruct an individual or a group of learners by providing evidence-driven instructional models like 4CID (Van Merrienboer, Clark, & De Croock, 2002).

For several decades, educational researchers were busy understanding the “black box” of learning, unveiling the underlying dynamics and factors that lead to successful learning. More recently, the education technology research community was busy trying to understand the following question: Is there place for technology to facilitate learning and teaching?

An Historical Perspective on Education Technologies

The first massive implementation of digital technologies in education dates back to the mid-1980s, with the diffusion of the modern personal computer. American universities started sharing course content in the university libraries implementing the so-called Computer-Based Learning . Higher education institutions took advantage of the computer by developing distance courses and primitive forms of e-learning systems. The 1980s constituted a “new spring” for Artificial Intelligence research. The invention of the back-propagation rule , which allowed Artificial Neural Networks to learn complex, nonlinear problems, generated a new wave of enthusiasm. The 1980s were characterized by the surge of Expert Systems , computer programs typically written in LISP that modeled specific portions of knowledge. In the domain of education and training, these systems took the name Intelligent Tutoring Systems (ITS) , adaptive computer programs which aimed at providing rich interaction with the student (Anderson, Boyle, & Reiser, 1985; Yazdani, 1986). The ITSs introduced the idea of the Tutor, an intelligent algorithm able to adapt to the individual learner characteristics, and that works as “instructor in the box” (Polson, Richardson, & Soloway, 1988) capable of replacing the human teacher. The AI-ITS vision was both controversial as well as technically complex to achieve for the 1980s. It did not fully take off as much as other educational technologies such as e-learning.

In the 1990s, the e-learning systems took further steps of developments. The computer in education shifted from being a knowledge diffusion system to a platform that encouraged sharing and developing knowledge between groups of learners. E-learning, however, became more popular as it was less ambitious and more applicable also to more ill-structured subjects, other than mathematics, programming, or other natural science. E-learning became a tool that could support computer-supported collaborative learning (Dillenbourg, 1999).

In the 2000s, digital technologies met a fast development also thanks to the fast-spreading of the Internet and the World Wide Web . In education research, the Technology-Enhanced Learning (TEL) community emerged. The initial focus of TEL was the e-learning systems and multimedia educational resources. While these educational contents were previously only accessible via a personal computer, in the late 2000s, they became available for portable computing devices such as smartphones, tablets, or laptops. These new technological affordances established the research focus on ubiquitous and mobile learning (Sharples, Arnedillo-Sánchez, Milrad, & Vavoula, 2009), i.e., learning anywhere at any time without physical nor geographical location constraints.

In the 2010s, we observed a data-shift in education technologies with the rise of the Learning Analytics (LA) research community (Ferguson, 2012). The core idea at the basis of the LA research was that learners interacting with computer devices leave behind a considerable number of digital footprints which can be collected and analyzed for describing the learning progress and help to optimize it (Greller & Drachsler, 2012). Ten years later, after LA research was introduced, the field moved significant steps forward by identifying additional fundamental challenges. Despite the vast amount of data collected, there is still confusion about how these data can be harnessed to support learners. One part of the LA research aims to foster self-regulated learning by stimulating learners to improve their metacognitive skills through self-reflection and social comparison with peer learners (Winne, 2017). Nevertheless, the common idea of providing learners with LA dashboards for raising their awareness does not naturally lead to change their behavior and meet their goals (Jivet, Scheffel, Drachsler, & Specht, 2017). LA deals also with challenges like how to ensure ethics and privacy (Drachsler & Greller, 2016), and how to change and improve learning design with the support of learning analytics and data-driven methods (Schmitz, van Limbeek, Greller, Sloep, & Drachsler, 2017).

Another limitation of LA relates to the data source used. So far, LA data are mostly related to learners interacting with a digital platform (e.g., Learning Management System) utilizing mouse and keyboard. LA research – as well as its predecessors – were born nested into the glass slab era: The primary learning and productivity tools are mediated by a computer screen, a mouse, or a keyboard. With such tools, there is little space for interactions with physical objects in the physical world. The lack of physical interactions during learning led to a reality drift for learning science. According to the theory of embodied cognition , humans have developed their cognitive abilities together with the use of their bodies and that is encoded in the human DNA (Shapiro, 2019). For example, the hands are made for grasping physical objects, or the human senses developed for witnessing sound, smell, or light. The limited data sources raise valid questions concerning the understandability and interpretability of the digital footprints analyzed by LA researchers. Trying to derive meaning from limited educational data brings the risk of falling into the street-light effect (Freedman, 2010), the standard practice in the science of searching for answers only into places that are easy to explore.

To include novel data sources and new forms of interaction, a new research focus has emerged within the LA research, coined as Multimodal Learning Analytics (MMLA) (Blikstein, 2013). The objective of MMLA is to track learning experiences by collecting data from multiple modalities and bridging complex learning behaviors with learning theories and learning strategies (Worsley, 2014). The multimodal shift is motivated from a theoretical point of view by the need to achieve more comprehensive evidence and analysis of learning activities taking place in the physical realm such as colocated collaborative learning (e.g., Pijeira-Díaz, Drachsler, Kirschner, & Järvelä, 2018), psychomotor skills training (e.g., Di Mitri, Schneider, Specht, & Drachsler, 2019a; Schneider & Blikstein, 2015), and dialogic classroom discussions (e.g., D’mello et al., 2015) which were underrepresented in LA research and other data-driven learning research. In parallel, the multimodal shift is also stimulated from a technological push given by the latest technology developments (Dillenbourg, 2016). Learning researchers are making use of new technological affordances for gathering evidence about learning behavior. In recent years, the low costs of sensor devices made them more affordable. Sensors can be found embedded in smartphones, fitness trackers, wrist-based monitors, or Internet of Things devices and provide the possibility to continually measuring human behavior. These devices can collect streams and measure life aspects such as hours and quality of sleep, working and productivity time, food intake, and physiological responses such as heart rate or electrodermal activity. The multimodal sensors can collect “social signals” – thin slices of interaction that predict and classify physical and nonverbal behavior in group dynamics. Multimodality is relatively a novelty in the field of learning. For this reason, we introduce the metaphor of the new land which encloses the promise – or probably the hope – to understand learning and human behavior better.

In the 2020s, a new kind of educational technology is taking off. We introduce this new technology under the name of Multimodal Tutor, a new approach for generating adaptive feedback from capturing multimodal experiences. The Multimodal Tutor capitalizes on the support of multimodal data for understanding learning and human behavior pushing it to the next level. It proposes a theoretical and methodological approach to deal with the complexity of multimodal data, combining artificial intelligence with human assessment. With this hybrid approach, the Multimodal Tutor carries an advanced promise for learners, making learning more authentic, adaptive, and immersive. We argue the Multimodal Tutor may enable us to move toward a learner-centered and constructionist idea of learning as an active and contextualized process of construction of knowledge (Piaget, 1952). The multimodal approach is learner-centered as it focuses on the entire span of human senses and embodied cognitive abilities. It moves away from nonnatural interactions introduced by computers or smartphones, and it stimulates interactions with the physical world. In the meantime, it tracks information about the learner’ physiology, behavior, and learning context.

The Multimodal Tutor advocates for reuniting two branches of developments in education technology which have been developing in parallel. The first one is Learning Analytics and TEL research focusing primarily on deriving insights from learning data to support human decision-making. The second one is AI-ITS research, which for almost three decades has designed, developed, and tested artificially intelligent systems that model the knowledge of the learners and guide them through the learning activities domain.

Outline of This Book Chapter

This book chapter reports the insights of eight subsequent studies, which lead to the final design, ideation, and technical implementation of one example of Multimodal Tutor in the field of Cardiopulmonary Resuscitation.

The first study is Learning Pulse (section “Research study: Learning Pulse”) where we investigate the complexity of using multimodal data for learning which paved the way to the Multimodal Tutor. Learning Pulse discovered empirically a series of complex dynamics, of both conceptual and methodological nature, derived by using multimodal data for predicting learning performance.

In the literature study From Signals to Knowledge (section “Literature Study on Multimodal Data for Learning”), we explore the concept of multimodality by analyzing existing constructs and by conducting a literature survey. This qualitative research approach leads to the formulation of the Multimodal Learning Analytics Model (MLeAM), a conceptual model which serves as the “Map of Multimodality.” The MLeAM sheds light on the multimodal feedback loop that the Multimodal Tutor is set to accomplish.

If the MLeAM indicates the “way to go,” it does not say “how to get there.” There is, in fact, the need for a better understanding of the problem from a technological standpoint and the formulation of a possible solution. We describe this in 2.3 with the “Big Five challenges” for the Multimodal Tutor.

The Multimodal Pipeline (section “Position Paper: The Multimodal Pipeline”) proposes a technological framework for the cyclic nature of the MLeAM and addresses the “Big Five” challenges with technical infrastructure. The Multimodal Pipeline reveals to be the most critical part of the Multimodal Tutor research. The multimodal data streams are complex to align, synchronize, and store.

The Multimodal Learning Hub (section “Technical implementation: The Multimodal Learning Hub”) is the first prototype of the Multimodal Pipeline, which is designed to track learning experiences using customizable multisensor setups flexibly.

In section “Technical implementation: the Visual Inspection Tool”, we decide to focus on one specific, unsolved aspect of the Multimodal Pipeline, the Data Annotation. From this challenge emerges the idea of creating a Visual Inspection Tool , an application for annotating and inspecting multimodal data streams, which allows to “read between the lines.”

In this phase, we decide to narrow the focus to the specific domain of Cardiopulmonary Resuscitation Training (CPR). In section “Feasibility Study: Detecting CPR Mistakes”, we focus on modeling the CPR domain, mainly how to detect multimodal mistakes using machine learning techniques. Finally, the CPR Tutor is employed in a field study for feedback generation in (section “Research Study: Keep Me in The Loop”) where we report the design, development, and experimental testing of the CPR Tutor.

Main Findings

Research Study: Learning Pulse

The exploratory study Learning Pulse (Di Mitri et al., 2017). Learning Pulse aimed at predicting levels of stress, productivity, and level of flow during self-regulated learning. In the study, we gathered multimodal data from nine participants. The data consisted of (1) physiological data (heart rate and step count) from Fitbit HR wristbands; (2) software applications used on their laptops from RescueTime; and (3) environmental information (temperature, humidity, pressure, and geolocation coordinates) using web APIs. In two weeks, the participants had to self-report every working hour via a mobile application, the Activity Rating Tool . The participants’ data were collected in a Learning Record Store using custom Experience API (xAPI) triplets. The experimental setup chosen allowed too much diversity of tasks, resulting in an uncontrolled study and negatively influencing the results’ quality. Although the nine participants were PhD students of the same department, throughout the 2 weeks of the data collection, they used different laptops and software applications, which were grouped into categories. The collected data were heterogeneous: Some attributes as “step-count” exhibited random behavior, and some other attributes such as “heart-rate” had continuous values instead. To accommodate both types of continuous and random effects, we opted for Linear Mixed Effect Model (LMEM), a multilevel prediction algorithm typically used for time-series forecasting.

The collection of the labels needed for the data annotation was among the biggest challenges of Learning Pulse. The self-perceived levels of stress, productivity, and flow were reported by the participants retrospectively every hour using the Activity Rating Tool. We thus realized that the number of labels was not sufficient for supervised machine learning. For this reason, from each labeled hour, we derived 12 labeled intervals of 5 minutes. Finally, the data-processing approach was elementary, especially the Data Processing Application . The processing pipeline was tailor-made and not flexible nor reusable for other purposes outside of the study. The xAPI format revealed being a bottleneck when used for high-frequency sensor data such as heart rate or step-count. Storing each heart-rate update with an xAPI triplet store generated a load of redundant information that slowed down the data import and the overall computation. Finally, the poor results in the model accuracy did not allow to explore further the feedback mechanisms.


  • Data collections during long periods need to deal with the task diversity of each user and uncontrolled setups.

  • Tracking software applications used by the user leads to diverse sets of attributes for each user, which makes it more difficult to compare them.

  • Some modalities are continuous variables (e.g., heart-rate), and some others are random variables (e.g., step-count), which makes it hard to combine them and analyze them.

  • Fixed-time (e.g., hourly) self-reports are not always reliable and are subject to bias.

  • There is a trade-off between the number of labels needed for supervised machine learning and the time that humans need to annotate the data.

  • Harnessing the potentials of multimodal data require run-time systems such as data-processing pipelines instead of data analysis scripts which run only once.

  • xAPI is not suitable for storing and exchanging high-frequency sensor data due to the high overhead of the XML format.

Literature Study on Multimodal Data for Learning

This literature study (Di Mitri, Schneider, Specht, & Drachsler, 2018a) aimed at mapping the state of the art of Multimodal Data for learning. This field was emerging as Multimodal Learning Analytics (MMLA). The exploratory study Learning Pulse in (Di Mitri et al., 2017) and the related work done in the field were the main motivations driving this scientific investigation. Surveying the related literature showed that MMLA covered a scientific field scattered and not yet coherent. This work contributed to framing the mission of MMLA: using multimodal data and data-driven techniques for filling the gap between observable learning behavior and learning theories. We coined this mission “from signals to knowledge.” We conducted a literature survey of MMLA studies using the proposed classification framework in which we separate two main components: the input space and the hypothesis space that are separated by the observability line. The literature survey led to the Taxonomy of multimodal data for learning and the Classification table for the hypothesis space. Surveying the related studies allowed discovering exciting commonalities. For example, most of the studies using multimodal data looked primarily at metacognitive dimensions as a hypothesis as the presence of specific emotions in learning.

The literature survey led to propose a new theoretical construct, the Multimodal Learning Analytics Model (MLeAM), a conceptual model for supporting the emerging field of MMLA. MLeAM has three main objectives: (1) mapping the use of multimodal data to enhance the feedback in a learning context; (2) showing how to combine machine learning with multimodal data; (3) aligning the terminology used in the field of machine learning and learning science.


  • Sensors can capture observable learning dimensions that include behavioral, activity, and contextual data – we refer to this as the input space.

  • The unobservable learning dimensions such as cognitive, metacognitive, or emotional aspects stand below the observability line –we refer to this as the hypothesis space.

  • Using human-driven data annotation and machine learning, it is possible to infer the unobservable from the observable dimensions. This process is described by the Multimodal Learning Analytics Model (MLeAM ).

  • MLeAM shows how best to exploit machine learning and multimodal data to support human learning.

  • The work in MMLA is jeopardized as it cannot yet rely on standardized approaches and techniques.

  • Further research efforts must be put in technical prototypes, standardized technical infrastructure, run-time systems, and common practices for multimodal data for learning.

Position Paper: The “Big Five” Challenges

In the Big Five, we address one structural shortcoming in the MMLA field, as evidenced by the literature survey conducted in (Di Mitri et al., 2018a): the lack of standardized technical approaches for multimodal data support of learning activities. We claimed that this technical gap is holding back the development of the MMLA field by imposing the MMLA researchers to duplicate efforts in setting up data collection infrastructures and preventing them from focusing on data analysis research questions answering. In (Di Mitri, Schneider, Specht, & Drachsler, 2018b), the identified technical challenges are grouped into five categories, named the “Big Five” challenges of Multimodal Learning Analytics which are the (1) data collection, (2) data storing, (3) data annotation, (4) data processing, and (5) data exploitation. The chapter attempts to provide possible solutions to the flexible enough challenges for being employed in different contexts.


  • The technical challenges of MMLA can be grouped into five categories (1) data collection, (2) data storing, (3) data annotation, (4) data processing, and (5) data exploitation.

  • The five challenges represent the steps that need to be addressed for implementing a data-driven feedback loop.

  • Each of the challenges categories presents a set of subchallenges that need to be addressed by MMLA researchers.

  • Tackling all these challenges together is a complicated research effort.

Technical Implementation: The Multimodal Learning Hub

As tackling all the five challenges requires a complex effort, we decided to build upon an existing research prototype, a solution for the data collection and synchronization and the data storing: the Multimodal Learning Hub (Schneider, Di Mitri, Limbu, & Drachsler, 2018) (LearningHub). The LearningHub is a platform that can collect data from multiple sensor applications and synchronize them into session files. The most significant research outputs of the LearningHub are (1) a software prototype that can connect to multiple sensor applications running on Windows, and (2) the introduction of a new data-storing logic and custom data-format which we coined as Meaningful Learning Task (MLT-JSON).


  • Sensor devices have different software systems making the integration of data.

  • From multiple sources not trivial.

  • Sensors generate data at different frequencies.

  • One sensor stream can be composed of several attributes.

  • A typical problem of sensor fusion is the time synchronization of different devices. This problem can be addressed using having the LearningHub working as “master” that decides when the sensor applications should begin collecting the data.

  • As data continuous data collection is complex and expensive to realize. It is easier to adopt a “batch approach,” in which the user can decide when to “start” and “stop” the data collection.

  • The MLT-JSON format allows creating a document for each sensor device with multiple attributes and stores the data into human-readable format.

  • Although MLT-JSON adopts a verbose format (due to repetitive JSON tags) when compressed, its file size is reduced by 90–95%.

Technical Implementation: The Visual Inspection Tool

In (Di Mitri, Schneider, Klemke, Specht, & Drachsler, 2019), we focused on one of the five big challenges, the data annotation . This challenge deals with how humans can make sense of complex multidimensional data. In this chapter, we proposed a new technical prototype, Visual Inspection Tool (VIT). The VIT allows the researchers to visually inspect and annotate various psychomotor learning tasks captured with a customizable set of sensors. The file format supported by VIT is MLT-JSON, meaning that any recording session recorded with LearningHub can be loaded, visualized, and annotated using the VIT. The VIT enables the researcher (1) to triangulate multimodal data with video recordings; (2) to segment the multimodal data into time intervals and to add annotations to the time intervals; and (3) to download the annotated dataset and use the annotations as labels for machine learning predictions. Besides generically addressing the data annotation, the VIT also facilitates data processing and exploitation. The VIT is released as Open Source software (Code available on GitHub (


  • Sensor data are poorly informative when visualized; for this reason, they need to be complemented by evidence interpretable for humans, such as video data. Without video, data is not easy to make sense of what happened in the recorded session.

  • The numerical sensor attributes (as opposed to categorical variables) can be visualized as time series. The visualization of more than a couple of time series is tricky for the human eye; manually selecting the attributes to visualize, therefore, is crucial.

  • Audio and video data can be transformed into numerical time series (e.g., by extracting colors of pixels or audio features) and added in the multimodal dataset.

  • The annotation is a human interpretation of the data which apply to a specific time interval with a begin and end.

  • Each time interval (annotation) can consist of multiple attributes; this approach allows the optimal definition of binary and nonbinary classes.

  • Manually selecting the time intervals is an expensive task, which should be automated if possible – in the best-case scenario, the human role should be only of supervising, i.e., correcting and integrating the (semi)-automatic annotations.

Position Paper: The Multimodal Pipeline

The VIT, as well as the LearnigHub and its custom data format MLT-JSON, constitutes a chain of technical reusable which we coined as Multimodal Pipeline and that we described in (Di Mitri, Schneider, Specht, & Drachsler, 2019c). The Multimodal Pipeline is an integrated technical workflow that works as a toolkit for supporting MMLA researchers to set up new studies in various psychomotor learning scenarios. Using components from this toolkit can reduce developing time to set up studies, and it can facilitate and speed up the transfer of research knowledge in the MMLA community. The Multimodal Pipeline connects a set of technical solutions to the “Big Five” challenges described presented in (Di Mitri, Schneider, Specht, & Drachsler, 2019c). The Multimodal Pipeline has two main stages; the first one is the “offline training,” in which the collected sessions are annotated, and the ML models are trained with the collected data. The second stage is the “online exploitation,” which corresponds to the “run-time” behavior of the Multimodal Pipeline.


  • The Multimodal Pipeline describes in technical terms the data-driven feedback cycle proposed by MLeAM in (Di Mitri et al., 2018a).

  • There are two flows of data in the Multimodal Pipeline, the “offline-training” and the “online-exploitation.”

  • The Data Annotation happens typically before the data processing, as annotations are required for training the models.

  • The Data Annotation is not always required. The Multimodal Pipeline can serve different strategies of exploitation for the Multimodal Pipeline, besides predictive feedback using supervised ML; these include rule-based corrective feedback, pattern identification, historical reports, diagnostic analysis, or expert learner comparison.

  • The Multimodal Pipeline can harness multimodal data both for Learning Analytics Dashboards, for example, for raising awareness and stimulating orchestration in the learning activities; similarly, it can be embedded in Intelligent Tutors to achieve better adaptation personalization of the tutoring experience.

Feasibility Study: Detecting CPR Mistakes

In (Di Mitri, Schneider, Specht, & Drachsler, 2019a), we selected Cardiopulmonary Resuscitation (CPR) as an application case for the Multimodal Tutor. We selected CPR training as a representative learning task for carrying a study on mistake detection. CPR was chosen primarily because it is an individual learning task. It is repetitive and highly structured. It has clear performance indicators and training with high social relevance. Among the different specialization options that the Multimodal Tutor could take, we decided to focus on the design of a CPR Tutor. We introduced a new approach for detecting CPR training mistakes with multimodal data using neural networks. The proposed system was composed of a multisensor setup for CPR, consisting of a Kinect camera and a Myo armband. We used the system in combination with the ResusciAnne manikin for collecting data from 11 experts performing CPR training. We first validated the collected multimodal data upon three performance indicators provided by the ResusciAnne manikin, observing that we can classify the training mistakes accurately on these three standardized indicators. We further concluded that it is possible to extend the standardized mistake detection to additional training mistakes on performance indicators such as correct locking of the arms and correct body position. So far, those mistakes could only be detected by human instructors.


  • The quality of the data-training corpus is crucial for ensuring solid model training. Collecting data and training classifiers for a small number of participants leads to particular models that do not generalize well. Diversity and amount of the training data is the key.

  • There is no gold number in the number of annotated samples (chest compressions – CC) which needs to be collected; there is, however, a dependency with the number of attributes that will be considered.

  • Given that the samples (CCs) have different duration, it is important to resample to a fixed number of bins applying some trimming or.

  • Applying normalization, and min-max scaling of all attributes is important for achieving the best result; this has to follow the activation function used in the neural networks.

  • Increasing the number of input attributes (e.g., adding new modalities) increases the classification accuracy of the model; these attributes work as regularization factor, adding more “background noise” to the model and making it more robust.

  • Neural Network seems robust in accepting heterogeneous input while converging to good results; we decided that we could use the participants’ data as part of the same training set with the individual body differences.

  • It is difficult to capture the span of all possible mistake with a restricted number of participants; each participant tends to make only a small subset of mistakes; the solution found was asking participants to mimic some types of mistakes.

  • The task structure two sessions of 2 minutes performing CC is a tiring task for the participants.

  • Body size is different among participants, and it has an effect on sensor wearing, for instance, people with thinner forearms had some trouble wearing the Myo, which was too loose.

Research Study: Keep Me in The Loop

In (Di Mitri et al., 2021), we presented the design and the development of real-time feedback architecture for the CPR Tutor. To complete the chain of flexible technical solutions proposed by the Multimodal Pipeline, we developed SharpFlow (Code available on GitHub ( , an open-source data-processing tool. SharpFlow supports the MLT-JSON format used as well by the VIT and the LearningHub. The data serialized in this format are transformed by SharpFlow into a tensor representation and fed into a Recurrent Neural Network architecture trained to classify the different target classes contained in the annotation files. SharpFlow also implements the two data-flows of offline training and online exploitation . SharpFlow achieves the latter using a TCP server for classifying in real-time every new chest compression. In (Di Mitri et al., 2021), the architecture was employed first in an Expert Study involving ten participants, aimed at training the mistake classification models, second in a User Study involving ten additional participants in which the CPR Tutor was prompting real-time feedback interventions.


  • Learning from experts is complicated as experts do not make enough mistakes; instances of mistakes are needed to train the machine learning algorithm; in (Di Mitri et al., 2021), we asked the experts to mimic some common mistakes.

  • The amount of training data got from 10 experts was limited; while the findings could not be generalized, they provided some indication that the feedback of the CPR tutor had a positive influence on the CPR performance on the target classes.

  • The proposed architecture used for the CPR Tutor allowed for successful provision of real-time multimodal feedback.

  • The generated feedback seemed to have a short-term positive influence on the CPR performance on the target classes considered.

  • There is a hierarchy among the performance indicators: Some mistakes are less frequent but more critical than others, and they need to be corrected first; some other mistakes are more frequent but not so critical.

  • Imbalanced class distribution is a real problem; there seems to be an amplifying effect: The majority class in the training set tends to prevail even more in the test set and in the classification of new instances.

  • Down-sampling is not trivial; as we had five target classes, down-sampling one class would also affect the other ones; finding a fair balance among the classes was hard.

  • Oversampling seemed not trivial either with time series; generating fake data could undermine the prior class distribution.

  • Highest feedback frequency was set to 10s interval; more frequent feedback would distract or confuse the participant.

  • The feedback messages must be explained to the participants beforehand so that they know what to expect and what each message means, preventing confusion.

  • The SharpFlow online exploitation was swift (70 ms for classifying each instance); in this way, the overall system was not heavily disrupted every time it had to assess each single CC.

  • For the longer-term influence of the feedback on the target performance indicators, we would need to (1) collect data from more participants; (2) increase the number of sessions per participant; and (3) select participants with less experience so their performance is not optimal and feedback is fired more frequently.

The Multimodal Tutor presents a set of advantages for the MMLA community. It builds on top of a new proposed technological framework, the Multimodal Pipeline , which, in turn, is composed by a chain of technological prototypes such as the (1) Multimodal Learning Hub , (2) the Visual Inspection Tool, and (3) SharpFlow. All these tools are released, adopt the same data-exchange format (MLT-JSON), and are released under the Creative Commons – ShareAlike 4.0 International license ( .

The main advantage for the MMLA researcher of using such tools is that there is no longer a need to reinvent solutions for data collection, synchronization, storing, annotation, and processing. The MMLA researcher can focus on more specific aspects of their experiments, such as deciding which sensor configuration to use, depending on which modalities they need to be monitored. Similarly, decide what hypothesis to formulate, what unobservable dimensions of learning have to be assessed, and how can these dimensions be translated into an annotation scheme. MMLA researchers can ultimately focus on modeling the learning task, what sets of atomic actions, and what pedagogical and feedback intervention is suitable for correcting or optimizing the performance in each of these actions. The use of the Multimodal Tutor and its underpinning technological frameworks (Multimodal Pipeline) and conceptual model (Multimodal Learning Analytics Model) provide flexibility and multi-purposeness, pushing forward the entire MMLA field. By explaining how to support learning using multimodal data, the Multimodal Tutor generates scientific added value for different data-driven learning research communities, like the Learning Analytics & Knowledge and the Intelligent Tutoring System / Artificial Intelligence in Education. Ultimately, the Multimodal Tutor set the way for more “emerging” fields of research such as Hybrid Intelligence (Kamar, 2016), Social Artificial Intelligence, or Social Robotics (Kanda & Ishiguro, 2017) that are concerned how to best interface human communication with artificial (robotic) intelligence.


Among many advancements in MMLA research, the Multimodal Tutor still carries some limitations. First and foremost, the Multimodal Tutor consists still in a set of research prototypes not ready to be launched in the market as fully working products. There has to be extensive testing, quality-checking, or control of the existing functionalities to achieve production-ready software. Within the research applications of the Multimodal Tutor, there exist also additional limitations which can be divided into different levels: (1) learning domain level, (2) hardware level, (3) software level, (4) data level, and (5) model level.

At the learning domain level , we have been focusing primarily on CPR training, which is a common type of medical simulation. Related research using the components of the Pipeline have been created for Presentation Trainer (Schneider, Börner, van Rosmalen, & Specht, 2015), Calligraphy Tutor (Limbu, Schneider, Klemke, & Specht, 2018), and Tennis Table Tutor (Sanusi, Di Mitri, Limbu, & Klemke, 2021). We group all these learning tasks as individual psychomotor learning tasks in the physical space, i.e., practical training tasks where the learner has to individually master skills that require a high level of psychomotor coordination that takes place in the physical realm. For this reason, in this subset, we intentionally left out learning scenarios such as cognitive learning , i.e., tasks that require more reasoning and cognitive abilities, or social learning , tasks that require interaction by multiple actors and or by groups, or distance and online learning , including activities mediated by mouse and keyboards. We decided to narrow the focus to make the research contribution of the Multimodal Tutor more evident to the community. At the same time, we believe the boundaries of these scenarios are blurry; therefore, the proposed categorization may run into inconsistency. As specified in the next section, we firmly believe that in the future, the Multimodal Tutor can evolve to support also different types of learning scenario outside of its current focus. Modeling the learning task is a fundamental part of assessing how the Multimodal Tutor can be most supportive. Psychomotor learning tasks can differ primarily by two factors: (1) by their repetitiveness and (2) by their structuredness. Learning how to perform chest compressions during CPR is a highly repetitive learning task, as the learner needs to perform repetitive movements; at the same time, CPR is highly structured, as there are apparent performance indicators that define the characteristics of a good CPR performance. These two characteristics make CPR an ideal application scenario for the Multimodal Tutor. On the contrary, the calligraphy or foreign alphabet learning in the Calligraphy Tutor consists of repetitive tasks without clear performance indicators. The domain of public speaking of the Presentation Trainer consists of diverse and not repetitive movements which lack clear performance indicators for assessment.

At the sensor hardware-level, the quality of the collected data, the quality of the model training significantly, and thus of the feedback can be influenced. In the CPR Tutor and related reference application scenarios, we opted for commercial sensor devices in place of custom-made boards. Compared to custom-made boards, sensor devices such as Microsoft Kinect, Myo Armband, or Fitbit HR have the advantage of being widely tested, providing high-level drivers and having an API to connect and offer broad community support easily. Still, however, commercial devices have known limitations in terms of precision. In this research, we realized that the choice of the sensor setup should be based on compromises between precision, easiness of use, and relevance for the learning task investigated.

The third level concerns the limitations at the software level . The CPR Tutor and the LearningHub have been programmed using C# programming language that runs on Microsoft Windows 10 machines. The reason for such a choice was to make the best use of Microsoft devices like Kinect. The VIT has been developed in Javascript and HTML 5 but tested primarily with the Google Chrome browser. SharpFlow has been developed using Python 3.7. These choices could compromise the portability of the software components on different operating systems, browsers, or platforms.

The fourth level of limitation is at the data level . As mentioned earlier, the precision and quality of the sensor devices can influence the quality of the data gathered. However, the data limitations lay also in the choice of the participant size and the diversity of these participants. Participants can have different body sizes, way to approach the task, and physiological responses. We call this the inter-subject variability among the participants. This variability can be mitigated by training a model with a diverse population, which can generalize their behavioral characteristics. There is, however, always the risk that the general model flushes out individual peculiarities. As an alternative, it is possible to train one classifier for each participant. The drawback of this approach is that the models will be suitable only for one person and not generalizable to new participants.

Finally, some limitations can stand at the model level . There are several limitations to using the supervised machine learning approach. In CPR, the more collected CCs, the more robust and general neural networks can be trained for mistake classification. Such an approach is optimal when having a high number of annotated training samples available. Similarly, to set a clear division line between correct and incorrect learning performance, the learning task must have clear performance indicators. For example, in CPR, the compression rate needs to have between 100 and 120 beats per minute for being optimal. The machine learning research community well knows the drawback of supervised learning. There are alternative ways that can be explored to reduce the amount of annotated samples needed; those are unsupervised learning, one-shot learning, or transfer-learning techniques. Concerning the use of Recurrent Neural Networks, aside from the amount of training data, the other standard limitation is the tendency of overfitting the training set. Besides dividing the collected data set between training, test, validation, and performing cross-validation at the level of the training samples, it is essential to do it at the level of subject level. For example, it would be helpful to hold-one-participant-out to make sure that the data of one or more participants are entirely new and unseen by the model.


In this chapter, the limitations can be seen as a research plan for the future implementations of the Multimodal Tutor. Future research endeavors should go both in the theoretical and in the technical directions. From the theoretical standpoint, as evidence in the literature survey in (Di Mitri et al., 2018a), future works of the Multimodal Tutor should also look into empirical studies and meta-analysis to focus on the most suitable data representation for each modality and propose guidelines for efficient modality combination. It could be helpful to know the best between modality and available sensors in commerce, providing guidelines for the data analysis of multimodal data sets.

The Multimodal Tutor “of the future,” the Multimodal Pipeline, will improve and evolve as a concept to accommodate more reference application scenarios. For instance, one aspect deliberately left out both from the theoretical and from the application side is the social dimension of learning : the extent to which the teacher and the learning peers influence each other in a social context. For example, during collaborative learning or physical classroom activities, social learning is of paramount importance. We think of the implementation of the Multimodal Tutor in the Classroom of the Future. Along the line of experimentation proposed by the EduSense prototype (Ahuja et al., 2019), the Classroom of The Future will embed a run-time framework which controls different sensors, for example, installed in laptops, chairs, or desk and connect to various actuators such as the projector, the smart board, some lights. The purpose is to automatically orchestrate learning activities in the classroom. For this purpose, a renewed conceptualization of the Multimodal Pipeline as a framework that runs continually on run-time is needed (Schneider, Di Mitri, Drachsler, & Specht, 2019). From such a system, learners and teachers could profit, for example, the system could identify students at-risk. Along this line, the system Lumilo provides an inspiring example of real-time teaching support using augmented reality by identifying and signaling students at-risk to teachers with the help of “virtual hands” (Holstein, McLaren, & Aleven, 2018).

From the technical point of view, future implementation of the Multimodal Tutor can move away from collecting short and high-frequency data sessions toward more extended data collection periods, which can last days or weeks. In our vision, the Multimodal Tutor can become a learning companion that supports the learner throughout the entire duration of a course until the target skill is mastered correctly. For this reason, we imagine future personalized learning technologies like the Multimodal Tutor can be on-demand, wherever and whenever the learner needs them. The functionalities of the Multimodal Tutor should be embedded in personal devices such as smartphones or smartwatches, which can be at the learner’s fingertips. To become entirely ubiquitous, the Multimodal Tutor needs to better leverage cloud-based technologies. In that case, the learner would need only a device and an Internet connection for using the functionalities of the Multimodal Tutor for learning support. Given the significant amount and the data gathered from the sensors, sending the complete streams to the cloud might be an overhead for the network infrastructure. An option alternative to cloud computing that should be explored is fog computing, in which only relevant data or decisions are sent to the online server.

Future research of the Multimodal Tutor should look at improving the user experience from the learner perspective. As argued in this book chapter, self-reports, questionnaire, and user ratings are essential for collecting the learning labels necessary to annotate the multimodal experiences and allow the system to learn from historical data. Repeatedly asking the learner to answer a questionnaire or to submit a report can become, nevertheless, a pretty tiring task. Stratagems have to be thought to maximize usability and user retention, to mature the Multimodal Tutor from a research to a productivity tool.

Another paramount issue connected to the user experience is ensuring user privacy when collecting high-frequency and highly personal multimodal data. Future Multimodal Tutor applications need to be designed with better privacy features. For instance, they need to implement multiple privacy layers, consisting of features such as end-to-end encryption, authentication, or distributed data saving. The Multimodal Tutor should connect and use the concept of Trusted Learning Analytics (Drachsler & Greller, 2016). The learner has to become the ultimate authority over the data and the algorithms. The technology embedded in the Multimodal Tutor rather than judge and punish the learner should ultimately support and improve learning which is a fundamental part of human nature.