1 Introduction

The socio-cognitive theory posits that learning is an active constructive process in which individuals intentionally seek and process information (Bandura 2001; Pintrich 2000). From this perspective, learning involves the interaction of cognitive, motivational, and emotional processes situated in a learning context (Zimmerman and Schunk 2011). Such processes during learning are not confined to a single individual learner. Learners generally learn in a social context that involves interaction with their peers, teachers, and even parents. Learners are not only responsible for their own cognition, motivation, and emotion, but are also collaboratively responsible for the thoughts, feelings, and actions of others (Hadwin et al. 2017).

Furthermore, learners are active agents in social and technology-mediated settings, interacting with their teachers and peers, various different technologies, and numerous artefacts which are available to them, often in collaborative learning environments (Azevedo et al. 2011). Thus, the quality of learning depends on the complex relationships between the cognitive, motivational, and emotional processes and the external sources surrounding students, such as teachers, peers, and/or technologies tools (Bandura 2001; Miyake and Kirschner 2014). The complexity and reciprocity of learning processes and social factors during learning form a major challenge for the learning sciences in their quest to understand these processes and to find effective and efficient ways to facilitate successful learning. One way for unearthing the complexity of learning processes might be to approach the learning phenomena from a multimodal perspective.

1.1 Multimodality of learning

Multimodality refers to the forms of communication and meaning-making that go beyond spoken or written language (Scollon and Scollon 2009). It includes speech, writing, and “visual, aural, embodied and spatial aspects of interaction and environments” (Jewitt 2013, p. 250). In this sense, scholars refer to learning as a multimodal activity (Ochoa et al. 2016). When learners make meaning alone or together in a learning context, they speak, write, draw figures, use facial expressions, move their bodies to represent and communicate about their meaning, manipulate objects, and make use of these multiple modalities concurrently (Magana et al. 2019; Morales et al. 2003).

For example, learners working together in teams ultimately construct their understanding about different phenomena by discussing and negotiating the meanings they have gathered both from available learning resources and each other through different sensory and communicative modalities (Riquelme et al. 2019; Anastopoulou et al. 2011; Kress 2003). In this regard, multimodal research, which is studying individuals’ learning endeavours by gathering and exploring different multimodal data, might help to explain the complex interplay of cognitive, motivational, and emotional processes during the learning process.

1.2 Different forms of multimodal data

The multimodal data gathered from learning settings can be either subjective or objective. Subjective data tells about the traits and aptitudes of learners, their perceptions about a specific learning activity (e.g. Khalifeh et al. 2020), or descriptions about their mental states during learning (e.g. Bannert et al. 2014). Self-reported questionnaires, learning diaries, and audio-video coding of learners’ mental activities are common examples of subjective data (e.g. Farrokhnia et al. 2019). Although self-reported data might reveal learners’ stated intentions to learn and their beliefs about themselves as learners, such perceptions often do not match with what is actually happening during learning processes (see Noroozi et al. 2018; Winne 2004; Zimmerman 2008). Furthermore, subjective coding of observation data could also be tainted by the coders’ interpretations of observed behaviours (see Zimmerman 2008).

Objective data informs about the observable representations of cognitive and affective events that learners actually perform during learning (Winne 2010). Log traces in digital learning environments, physiological reactions during learning such as heart rate variability, skin conductance, and eye gaze can be considered as objective data types. Several educational researchers have used objective data to infer various cognitive and affective states such as cognitive load (Cranford et al. 2014; Haapalainen et al. 2010), and emotions during learning (Chanel and Mühl 2015; D’Mello 2013; Fairclough et al. 2005).

Although using objective data types alone can capture cognitive or affective states of learners in various learning situations (D’Mello et al. 2017), they have to be contextualized in order to relate them with the learning processes. Such data can be combined with observational data such as audio and video recordings of learning situations (Malmberg et al. 2017) to reveal sequential and temporal dynamics of learners’ regulatory processes (i.e., planning, enacting strategies, reflection, adaptation) which indicates how previous small-scale situated adaptations in terms of regulation of situated challenges contribute to large-scale adaptation (Hadwin et al. 2017). Such a combination can significantly extend our current knowledge on the sequential and temporal nature of the complex learning processes (Azevedo et al. 2011; Winne and Hadwin 2013).

However, capturing and analysing multimodal data in learning contexts, along with their facets is not a straightforward process and mostly requires both input tools and analytical tools that are sensitive to both the variability between and the complexity of different data modes (Di Mitri et al. 2019; Flewitt et al. 2009). Thus, it is necessary to use various technologies and tools to gather dedicated multimodal data and then analyse it in sophisticated ways, to better understand the complexity of learning in all of its nuances and intricacies.

1.3 Technologies for gathering multimodal data

Digital technologies and advanced educational tools can provide researchers with the possibility to combine subjective and objective data, to trace various cognitive and affective learning processes of the learners, to make micro-level environmental interactions and their responses of the body and brain visible (see Reimann et al. 2014), and to analyse different multimodal data (see Noroozi et al. 2019). In this regard, technologies offer many different opportunities or affordances to both sense and facilitate multimodality in learning (Azevedo and Gašević 2019; Drysdale et al. 2013).

These affordances can be: audible (we can hear sounds in digital environments), visible (we can see objects or people in digital environments), tangible (we can touch or click objects in digital environments), presence-related (we can share the same space with others and/or sense each other’s presence (i.e., social presence) in digital environments), temporal (we can be present at the same time with others in digital environments), reviewable (we can access the messages in digital environments again and again; that is there is a tangible history), and revisable (we can repeatedly update the messages in digital environments) (Kraut et al. 2002). In digital environments, learners typically generate a great amount of cognitive, metacognitive, motivational, and emotional data on what is attended to and studied, in what order this occurs, how much time was spent on what, at what times certain actions occurred, at which places in the study environment, and so forth (Azevedo et al. 2017a, b).

1.4 Research purpose

There is a growing interest in the learning sciences to take advantage of multimodal technologies and techniques to understand the complex relationships within and between individuals and their accompanying motivational and emotional reactions during learning (Blikstein 2013; Jeong et al. 2014; Martin and Sherin 2013; Ochoa et al. 2016). However, to our knowledge, no systematic review has been conducted to provide an overview of the affordances of multimodal data in terms of understanding the cognitive, motivational, and emotional processes during learning. It is not clear what and how data modalities are used to capture cognitive, motivational, and emotional learning processes. As a result, in this paper, we provide an overview of multimodal studies on learning and its underlying processes in order to better understand how to optimally use and combine data modalities when investigating various aspects of learning processes in educational settings.

We review target journal, country of the conducted study, covered subject, participant characteristics, educational level, foci (i.e., cognition, motivation, and/or emotion), type of data modality, research method, type of learning, learning setting, and modalities used to study the different foci (modality-focus). We summarise, analyse, and interpret a comprehensive set of data modalities that are used in combination for measuring various aspects of cognitive, motivational, and emotional learning processes. Specifically, we provide a systematic review of the literature on the type and combination of modalities that have been used for capturing different learning processes by seeking an answer to the following research:

  1. 1.

    What is the current status of data modality studies to investigate learning processes in terms of the target journal, country of the conducted study, covered subject, participant characteristics, educational level, foci, type of data modality, research method, type of learning, learning setting, and modality-focus?

  2. 2.

    What and how are data modalities used to capture cognitive, motivational, and emotional learning processes?

2 Method

A narrative analysis approach (see Noroozi et al. 2012) was used to identify current uses of multimodal data in various fields of learning research and also to address theoretical and methodological implications and avenues for further research. In such a narrative analysis, the aim is to systematically analyse and integrate the state of knowledge in the field and also to highlight areas that research has left unresolved (Van Dinther et al. 2011).

2.1 Search keywords and databases

A list of search keywords was selected based on the most important concepts of the study organised into three concept areas, namely 1) multimodality, 2) learning, and 3) learning facets such as cognition, emotion, and motivation. Upon a first, exploratory search, multimodal proved to be quite a generic adjective and caused many irrelevant studies to appear in the search results. The relevant nouns accompanying multimodal for the scope of the study were identified to be data, learning analytics, and signals. Additionally, the term triangulation was included in that concept area since some relevant papers do not use multimodal in their terminology, but they use triangulation, which implicitly implies multimodality since the triangulation approach can only be achieved by using different data modalities. Using Merriam-Webster’s Online, it was decided to include acquisition in the learning concept area since, according to the dictionary, they are closely related concepts, and learning is the acquisition of either knowledge or skill.

Once completed, the keywords within concept areas were combined with the Boolean operator OR and the three concept areas with the Boolean operator AND to arrive at the following search strings for Web of Science® and ERIC databases respectively:

  • TS = ((“multi*modal data*” OR “multi*modal learning analytics” OR “multi*modal *signal*” OR triangulat*) AND (learn* OR acqui*) AND (cognit* OR emoti* OR motivation* OR collaborat*))

  • ((all(“multi-modal data”) OR all(“multimodal data”) OR all(“multi-modal signal”) OR all(“multimodal signal”) OR all(triangulat*)) AND (all(learn*) OR all(acqui*) OR all(cognit*) OR all(emoti) OR all(motivation*) OR all(collaborat*)))

Note that the asterisk (*) wildcard, that replaces multiple characters anywhere in a word, was used to capture all the possible words having the same stem of the keywords of interest. Thus, for example, cognit* fetches papers with the words cognitive, cognition, and/or cognitions, provided that the other conditions in the search string are met. Note also that although the search strings are quite similar—they actually define the same search—their syntax differs.

An exploratory search for articles was initially conducted on the online repositories of: Education Resources Information Center (ERIC) Digital Library, Web of Science® (WoS), IEEE Xplore®, and SpringerLink®. ERIC was selected as it is the largest repository in education. A quick inspection showed that IEEE and Springer databases produced results that were not in the field of education, and thus out of the scope of the review. Moreover, SpringerLink allows exporting a maximum of 1000 results, while over 2000 results were obtained. So they were excluded from further consideration. The searches on ERIC and WoS were conducted on June 21–22, 2017, resulting in 669 and 318 hits, respectively, totalling 987.

Later, it was noticed that some authors use the term multichannel instead of multimodal, and therefore, for the sake of completeness, we included multichannel in the corresponding concept area. In addition, ScienceDirect® (Elsevier) database was also searched since it contains journals targeting research at the intersection of technical and educational aspects, and thus, with the potential to find more novel data modalities in the field.

On October 3, 2017, the searches were updated on ERIC, WoS, and also ScienceDirect and included for further scrutiny. The updated search yielded 332 results from WoS, while no new results from ERIC were found. Therefore, in the second round, a total of 429 (WoS + ScienceDirect) results were found.

The final keyword searches, for WoS, ERIC and ScienceDirect respectively, were:

  • TS = ((“multi*modal data*” OR “multi*modal learning analytics” OR “multi*modal *signal*” OR multi*channel* OR triangulat*) AND (learn* OR acqui*) AND (cognit* OR emoti* OR motivation* OR collaborat*))

  • ((all(“multi-modal data”) OR all(“multimodal data”) OR all(“multi-modal signal”) OR all(“multimodal signal”) OR all(multi*channel*) OR all(triangulat*)) AND (all(learn*) OR all(acqui*) OR all(cognit*) OR all(emoti*) OR all(motivation*) OR all(collaborat*)))

  • (“multi*modal data*” OR “multi*modal learning analytics” OR “multi*modal *signal*” OR multi*channel* OR triangulat*) AND (learn* OR acqui*) AND (cognit* OR emoti* OR motivation* OR collaborat*)

2.2 Additional search parameters

Using the respective database functionality, various search parameters were specified to narrow down the results to those potentially relevant for this review. The parameters allowed us to refine the document type, language, and year of publication.

To obtain scientific fidelity of the studies, only peer-reviewed publications were included. This implies that other publications such as books, book chapters, dissertations, thesis, conference proceedings, and reports were not included in the analysis because of the lack of information on how the review process had been carried out with these publications. These important and relevant publications were however, consulted in order to shape the theoretical framework of the study and to further accumulate the state of knowledge and specific issues in this field. However, it should be noted that an explicit peer-reviewed option was only available in the ERIC database, while this was not the case in WoS and ScienceDirect. Only published English articles were included in the study since English is the lingua franca of science and the common language of the authors. To study the most recent literature in the field, the time span was limited to publications from 2000 through 2017. This study was not restricted to a single discipline of interest and thus all publications from any domain and/or discipline were included.

2.3 Identification of relevant publications

The results from both the first and second search rounds were then screened. We inspected titles, abstracts, and, when necessary, the full text of the articles and removed a number of irrelevant publications that did not meet the purpose of the study. Publications that were excluded from the further analysis did not: 1) include evidence related to the learning sciences and report at least one of the aspects of learning processes and/or outcomes (i.e., learning cognition, motivation, emotion); 2) use at least two modalities of data (i.e., studies focused only on one data modality); 3) belong to the formal educational levels such as primary, secondary, high school, college, or higher education (studies conducted in summer-schools, second language courses, distance learning, online courses, and other extra-curricular activities were also included); 4) report empirical findings on the topic such as conceptual, methodological, and theoretical publications. Obviously, duplicate publications were also removed. Additionally, seven publications had to be removed due to the unavailability of the full text, in spite of the efforts made to contact their authors via email and/or ResearchGate®.

Although this systematic review targets empirical studies, we used conceptual and methodological publications to support the results of empirical studies with conceptual literature. Focusing only on the empirical studies could have yielded an incomplete picture of the state of the art of this topic. Therefore, both conceptual and methodological papers were used in the review but not in the analysis to produce an accurate representation of this body of knowledge under a number of research paradigms.

The identification process of relevant publications was carried out by two coders (co-authors) independently for the sake of reliability, resulting in 173 relevant publications included from the first round, and 34 from the second. The number of publications meeting the relevant criteria for the analysis was 207 papers in total.

A checkpoint for inter-rater reliability was set after the classification of the first round of reviewed publications into relevant or not relevant. At this point, the Cohen’s Kappa was .40. The coders met to discuss the discrepancies, after which the reliability improved to .68. When processing the second search round, the inter-rater reliability, as measured by Cohen’s Kappa was .84. Altogether, the Cohen’s Kappa was .61. We then resolved all disagreements and reached consensus through discussion between the coders and also the first author of this study.

3 Results

Applying the systematic search strategy, 207 publications were deemed eligible for inclusion in this review. A complete list of publications is provided in Appendix Table 1, categorised by author(s), target journal, country of the conducted study, covered subject, participant characteristics, educational level, foci, type of data modality, research method, type of learning, learning setting, and modality-focus.

3.1 Results for research question 1

The 207 multimodal publications found in the search were distributed among 139 journals. About a quarter (25.2%) of the included publications came from journals that had two or more multimodal publications, while almost three quarters (74.8%) came from journals that resulted in only one multimodal publication. The journals System (nine cases), Recall (seven), English Language Teaching (six), BMC Medical Education (six), Computers in Human Behavior (four), Computers and Education (four), International Journal of Science Education (four), and Nurse Education Today (four) were on top of the list of the publication outlets due to their vast coverage of the focal point of this research. The remaining publications were found in different journals of various fields ranging from soft sciences such as Teaching and Teacher Education, International Journal of Research in Education and Science, Education & Training to hard sciences such as School Science and Mathematics.

About 26% of the multimodal publications were published within the subject of language studies focusing on different aspects of second language acquisition. The second most common category was related to the STEM subjects (18%), such as mathematics and physics. Studies on modality research are used in different curricula both in hard subjects such as mathematics, chemistry, physics, medicine, and biology as well as soft subjects, namely the social sciences (e.g., humanities, psychology, economics). About 15% of the publications did not specify a discipline (see Fig. 1).

Fig. 1
figure 1

Distribution of subjects in the studies included in the review expressed in percentages

The number of participants reported ranged from one to 1384 (M = 81.00; SD = 134.40), while 17 publications did not specify the number of participants. Multimodal research is not restricted to any continent and is studied across all continents. The majority of multimodal research studies have been conducted in the USA (45 publications) and the UK (18). This is followed by countries such as Taiwan (12), Japan (11), Turkey (nine), Australia (eight), Malaysia (eight), and Hong Kong (eight). Only eight multimodal research studies reported results from at least two countries, which stresses the need for a more multicultural dimension of this field of research.

The educational context of the studies varied. The majority of multimodal research studies (46%) were conducted in a university setting with undergraduate students as the target group. The other popular target group for multimodality research was pupils in primary education (11%) followed by secondary education (7%), high school (4%), graduate-level (3%), and early childhood education (1%). No study was conducted in vocational education. Data from at least two educational levels were reported in 5% of the publications, while 22% of the studies did not specify their educational levels where data were collected (see Fig. 2).

Fig. 2
figure 2

Educational context of the empirical studies included in the review by number of studies

The majority of multimodal research studies (116 publications) used mixed methods to analyse various aspects of cognitive, motivational, and emotional learning processes; only eight studies exclusively used qualitative methods (e.g., interviews and observations) and 83 exclusively used quantitative methods (e.g., students’ products, surveys and performance tests).

When it comes to the type of learning, the findings showed that most of the multimodal publications (84%) focused on individual learning, 11% focused on group learning, and only about 5% of the publications on both individual and group learning. The review shows that multimodal researchers have been mostly investigating individual learning, followed by dyads, triads, small and large groups. For publications on group learning, the group size varied both within and between publications. The minimum size of the learning groups was dyads of learners, and the maximum size of the learning groups was between 20 and 30 members. The group size was fixed for about a quarter (26.3%) of the group learning publications, while in the vast majority of the group learning studies, the group size was not fixed (e.g., 3–4, 3–5, 7–8).

Of the 207 publications, 55% (113) were conducted in the regular (on-campus) setting. About 35% (59 publications) were conducted in the courses offered in the online learning programs. Only three publications (1%) were related to the extra-curricular activities outside the official academic setting. About 10% of publications were conducted in the mixed online and regular setting, while about 12% of studies were categorised within the mixed online and extra-curricular setting. The remaining 2% of studies did not report the learning setting of their study.

3.2 Results for research question 2

Figure 3 displays the distribution of participants per modality. In total, 39,812 participants were observed from 190 of publications that reported about their participants. The survey modality captured the largest number of participants (12,006) followed by an interview (9117), observation (6740), and performance measure (4587) studies. Heart rate variability as an objective modality captured the smallest number of participants with only six participants.

Fig. 3
figure 3

Distribution of participants per modality

In total, 18 types of modalities were classified in this review. For the 207 multimodal publications, 721 occurrences of modalities were observed. The average number of modalities per publication was 3.48. The maximum number of modalities was seven, and the minimum, as set by the inclusion criteria, was two. Interview, with 182 occurrences, was the favourite modality for data collection. Furthermore, survey (168 occurrences), observation (135), student product (119), and performance measure (47) were the following most frequently used methods of multimodal data collection. The least common modalities were heart rate variability (one occurrence), facial expression recognition (two), and screen recording (four). Figure 4 depicts the type and number of modalities in the reviewed publications.

Fig. 4
figure 4

Modalities used in the empirical studies, the numbers represent the number of studies in which they were used

From the 207 multimodal publications, 98 focused exclusively on the cognitive aspect of learning, followed by 27 that only focused on motivation, while only five papers exclusively focused on the emotional aspects of learning. The remaining publications touched at least two combined aspects of learning. The most frequently studied paired-focus with 46 publications was related to the cognitive and motivational aspects of learning. Only 14 publications touched the cognitive and emotional aspects of learning at the same time. This was followed by seven multimodal publications that studied motivational and emotional learning processes at the same time. Only ten publications focused on cognitive, emotional, and motivational aspects of learning at the same time.

The focus of each publication in terms of data modalities was paired with different aspects of learning. Out of 721 occurrences of modalities in the reviewed publications, 437 occurrences focused on measuring the cognitive aspect of learning, followed by 203 related to motivation. Only 81 occurrences of modalities were allocated for studying the emotional aspect of learning. Interview, survey, observation, and student product as the most popular data modalities were mostly used to measure cognitive aspects of the learning, followed by motivational and emotional aspects (see Fig. 5 for distribution of the modalities in terms of different foci).

Fig. 5
figure 5

Distribution of the modalities in terms of different foci expressed in percentages of total number of studies using that modality

The focus of each publication in terms of data modalities was paired with the type of method (i.e., quantitative, qualitative, and mixed) of the multimodal publications. The observation was the main modality used in qualitative studies, while interviews and surveys were the most frequently used methods in quantitative studies. Interview, survey, and performance methods were used in mixed studies (both qualitative and quantitative).

Each publication in terms of the foci of the paper was paired with the type of method of multimodal publications. Qualitative methods were popular to capture cognitive aspects of the learning process, quantitative methods were used for both cognitive and motivational aspects of learning, while mixed-methods captured the combination of cognitive, motivational, and emotional aspects of learning.

The focus of each publication in terms of data modalities was paired with the educational level (ranging from early childhood education to university graduate students) of the multimodal publications. Modalities such as interviews, observation, and performance tests were the most typical type of data collection for lower levels of education (e.g., early childhood, primary, and secondary school). Modalities such as interviews, surveys, observation, and student’s product such as reflection reports were the most typical type of data collection for higher levels of education (e.g., high school, undergraduate and graduate university students).

Each publication in terms of the foci of the paper was paired with the educational level of the multimodal publications. No distinctive pattern was found in the foci of the publications for the different levels of education. However, in lower levels of education, the focus was on researching cognitive and motivational aspects of learning, while researchers studied combined cognitive, motivational and emotional aspects of learners mostly in higher levels of education.

The focus of each publication in terms of data modalities was paired with the type of learning setting (i.e., regular, online, and extra-curricular) of the multimodal publications. Modalities such as interviews, surveys, and student products were the most typical types of data collection for the regular learning setting. Modalities such as surveys, observation, and performance tests were the most typical type of data collection for the online learning setting. Modalities such as interviews and observation were the most typical type of data collection for the extra-curricular learning setting in the reviewed publications.

Each publication in terms of the foci of the paper was paired with the type of learning setting (i.e., regular, online, and extra-curricular) of the publications. Cognition was the most frequent focus of the publications for the regular learning setting. In online learning settings, cognition was mostly measured along with either motivational or emotional aspects of learning. No distinctive pattern was found in the foci of the publications for the extra-curricular learning setting. Again, cognitive and motivational aspects of learning in combination were the most frequent focus of the papers for the extra-curricular learning setting.

The focus of each publication in terms of data modalities was paired with the type of learning (i.e., individual, group, and mixed) of the publications. Modalities such as surveys, interviews, and observation were the most typical type of data collection for individual learning. Modalities such as surveys, interviews, and student products were the most typical type of data collection for the online learning setting. The observation was the most typical modality for the extra-curricular learning setting in the reviewed publications.

Each publication in terms of the foci of the paper was paired with the type of learning of the publications. Cognition was the most frequent focus of publications in all types of learning. In both individual and group learning setting, the emotion was the least touched aspect of learning of the reviewed publications. No distinctive pattern was found in the foci of the multimodal publication when the mixed learning setting was used in the reviewed publications.

4 Discussion

In this systematic review, we aimed to provide an overall picture of the utilisation of multimodal data in learning research. The current review yielded 207 multimodal papers that used more than one data modality to investigate various learning processes. In the following subsections, we elaborate on and synthesise the findings around the research questions.

4.1 Characteristics of multimodal data studies

The findings revealed that the majority of published papers are dispersed across a wide spectrum of different journals ranging from language learning to medical education to educational technology fields. These findings underline the widely distributed and scattered nature of multimodal research. With respect to the methodological issues involved in carrying out multimodal research, there is a need for a multimodal data publication outlet dealing with these issues. Such a journal might help to understand the methodological and analytical skills needed to deal with multimodal data. In addition, such a multimodal journal might act as a venue to develop standard procedures and tools for processing and combining different data modalities in learning research.

The reviewed publications represent all major regions of the world, although North America produced the most. Few studies combined samples from participants from different countries. It seems that multicultural aspects of learning making use of multimodal data is neglected. It is a point of attention since past studies revealed that there might be cultural differences in terms of specifically interpreting emotional cues and motivations (Dekker and Fischer 2008; Eid and Diener 2001; Ekman et al. 1987). For example, Yuki et al. (2007) found that Japanese people focus on the position of the eyes when interpreting emotional expressions, whereas Americans tend to focus on the position of the mouth. Masuda et al. (2008) further found that Japanese people pay attention to the social context (i.e., surrounding individuals’ emotions) when interpreting one’s emotions whereas Westerners pay less attention to the social context and focus more on the person of interest. Further, a meta-analysis by Dekker and Fischer (2008) revealed significant differences between different societies in terms of academic achievement motivations. Such findings indicate that the motivational and emotional aspects of learning might vary in different cultural contexts. This is particularly important due to the internationalisation of education. Most educational institutions are melting pots of different cultures in developed countries. Thus, multimodal data collection from multiple cultures might be particularly necessary when investigating the motivational and emotional aspects of learning.

The majority of the multimodal studies were conducted with university students. Further, none of the reviewed publications was conducted in vocational or workplace learning settings. These findings point out the need for widening the sample scope of the multimodal educational research from college settings to other educational institutions for more generalisable inferences. Further, it is known that with the increased use of digital technologies in classrooms, learning in K-12 settings has been become more multimodal (Ryan et al. 2010). Thus, collecting multimodal data from natural classroom settings in lower education levels (e.g. primary and secondary) might open new paths to dive into the experience of K-12 teaching and learning in the digital era. Further, the study showed that health sciences education has been a prominent field that benefited from multimodal data studies. Considering that health sciences are highly focused on skill acquisition through deliberate practices (McGaghie et al. 2014), our findings underline the importance of multimodal data in developing practical and procedural skills of individuals. In this regard, utilising multimodal data in vocational education might also be a promising approach to develop procedural and practical skills of the future blue-collar workforce.

Multimodal data analysis requires the alignment of different analytical methods to process the data coming from different channels. In this regard, multimodal data analysis can also be described as a multi-method approach. The findings of this study support such a conclusion. Our results revealed that the majority of the multimodal research papers used mixed methods and combined quantitative and qualitative methods to derive inferences from the multimodal data.

The findings showed that utilising multimodal data in collaborative learning settings constituted only a small portion of the publications, whereas the majority of the studies focused on individual learning. Indeed, collaborative learning is a more complex phenomenon to investigate than individual learning (Hadwin et al. 2011). This is due to the fact that multiple agents, with their own goals, plans, and strategies, concurrently participate in the group learning activity in collaborative learning influencing each other. When learning collaboratively, students should develop a common understanding and common goals on the learning activity (Dillenbourg 1999; Stahl et al. 2006). Further, they should effectively coordinate their own and other team member’s efforts to reach the group’s learning goals (Fransen et al. 2013). Thus, a single data channel might fall short of capturing how interactions unfold over time in a collaborative learning setting. In this regard, it is hoped that the use of multimodal data in collaborative learning research will be more prevalent in the future.

In terms of the learning settings, around half of the studies were conducted in regular school environments. Around one-third of the publications were conducted in online learning environments. One specific affordance of online learning environments is that they might allow researchers to trace learner activities with log data. Nevertheless, our findings showed that few multimodal studies (four papers) in online learning environments used log data. This might be due to two reasons. First, researchers have regarded online environments solely as a learning medium rather than seeing them also as a data channel. Second, online learning environments often used for the learning activity did not allow researchers to collect log data. Considering the first reason, we suggest researchers take advantage of online learning environments for collecting learning traces. A unique attribute of log data collection is that it is unobtrusive and takes place in real-time during learning (Winne 2017). Thus, with log data, it is possible to follow learning events at micro levels without interrupting the learners (see Malmberg et al. 2013). In terms of the second reason, researchers might consider using online environments that facilitate log data collection. Many of today’s online learning management systems facilitate time-stamped tracing of learner activities such as resources accessed, assignments completed, discussions attended, and information exchanged with others (Winne 2017).

4.2 What and how are data modalities used to capture cognitive, motivational, and emotional learning processes?

Subjective data (e.g., interviews, self-reports, and observations) were the most prevalent data types in multimodal learning studies. Our findings further showed that subjective data modalities were the most frequent data types used to research all aspects of learning. The use of objective measures such as heart rate variability (one case), facial expression recognition (two cases), screen recordings (four cases), and eye-tracking (five cases) was limited. Heart rate variability was used to complement self-reports, observations, and interview data to understand the anxiety of English language learners. Facial recognition was combined with observations, screen recordings, and log data to infer the emotional states (e.g., boredom, engagement) of learners in intelligent tutoring systems. Screen recordings were matched with observations, interviews, and student artefacts to investigate internet reading strategies, creative thinking strategies, or combined with log data to investigate text search strategies for writing. In one study, screen recordings were also used to investigate how the facial emotions of learners react to the interaction with a virtual tutor. Interestingly, several other objective data types (e.g., electrodermal activity, blood volume pulse, electroencephalogram, temperature, and accelerometer) were not found.

Overall, the existing findings highlight that the use of objective data modalities in learning research is still at infancy. In addition, although the use of physiological measures seems to be on the rise (see Pijeira-Díaz et al. 2016), it seems that they are mostly used alone and not in combination with other data modalities. This is unfortunate because objective measures offer various new venues for learning research. For example, it is known that physiological signals inform on specific cognitive or emotional challenges during learning (Henriques et al. 2013). In this regard, objective data can be combined with subjective data to explain the sequence of specific micro-level processes that result in particular perceptions, feelings, and other learning-related outcomes. Further, physiological data open new paths to explore social interactions in unique ways. For example, several measures have been developed to measure the physiological (e.g., heart rate variability, electrodermal activity) coupling between interacting individuals (Chanel and Mühl 2015; Palumbo et al. 2017). These measures allow researchers to investigate how the physiological coupling between individuals relate to several interaction features and group performance (Chanel et al. 2012; Henning et al. 2001). The underuse of some objective data in educational settings might also be due to practical limitations. That is, measuring physiological signals in natural classroom settings is more challenging than in laboratory settings. However, as more devices (e.g., smartwatches) are becoming available to measure physiological processes with low intrusiveness (Liao et al. 2012), it can be foreseen that use of objective measures in regular classroom settings will be less challenging in the near future.

The most investigated aspect of the learning process in multimodal studies was cognition. A significant number of publications also focused on motivational aspects of learning. The number of publications investigating emotional aspects of learning was quite low compared with the cognitive or motivational aspects. The under-exploration of emotions in multimodal learning research is also reflected in the limited usage of certain data modalities. For example, facial recognition, electrodermal activity, heart rate variability, blood volume pulse, and body temperature were rarely used though they can be indicative of emotional states in the human mind and body (Henriques et al. 2013). Utilising physiological measures in future learning research might open new paths to increase our knowledge, particularly on the emotional aspects of learning.

More than half of the reviewed publications solely focused on a single aspect of learning, and approximately one third focused on two. According to the general understanding, cognitive, motivational, and emotional processes interact with each other during learning (Zimmerman and Schunk 2011). Thus, learning research should focus on how such interaction among different learning processes unfolds over learning rather than exclusively focusing on the evolution of one single process. Nevertheless, the current results show that cognitive, motivational, and emotional processes have been so far researched in isolation or in dyads rather than learning groups of more than three members. Future research should put more effort into using separate data streams to measure different aspects of learning at the same time, also in collaborative learning settings.

Our findings further indicate that data triangulation in multimodal research has been mostly done with subjective data modalities. It seems that combining self-reports with interviews or observations has been the mainstream triangulation approach. The aim of data triangulation is to provide a comprehensive and multi-perspective understanding of the phenomenon investigated (Boyd 2000). In this regard, complementing subjective with objective modalities would be a better approach to derive less biased inferences and innovative understanding from learning data compared with the triangulation of subjective-only modalities. Multimodal approaches in learning research can help to tackle the constraints of typical single-channel data (e.g., subjective, objective, or physiological data), and help to draw more valid and reliable inferences about the learning processes (Harley et al. 2015; Pantic and Rothkrantz 2003).

Overall, the findings indicate that cognitive aspects of learning have been studied extensively compared with the motivational or emotional aspects of learning. It was also observed that objective data modalities that can specifically tap into emotional aspects of learning have been largely ignored. Therefore, future research should make use of objective data modalities to extend current knowledge on the emotional aspects of learning. Finally, cognitive, motivational, and emotional aspects of learning have been mostly investigated alone rather than in combination. Future research might utilise different modalities for measuring different aspects of learning to investigate how those aspects are intertwined with each other during learning.

5 Conclusion, limitations and future work

Our findings led to the conclusion that multimodal data is a vast area of research across various learning domains. Comparing the use of multimodal data in various domains is worth investigating in future studies. This would help to understand the affordances and limitations of multimodal data in the whole spectrum of learning domains. Further, multimodal research mostly focused on individual learning. In future studies, collecting multimodal data from collaborative learning settings might help to capture the complex social, motivational, and emotional processes that arise during collaborative learning. Although multimodal data were mostly gathered from on-campus settings, a significant portion of studies collected multimodal data from online or blended learning settings. This highlights the future potential of multimodal data in learning. That is, tracing learning activities with log data and combining those traces with physiological or subjective data might provide new insights on learning in online environments. This study illustrates that multimodal research has mainly benefited from conventional data types such as self-reports, interviews, and observations. Our results plea for including objective data modalities in learning research as well. Particularly, affordances of physiological data in terms of increasing the relatively low number of publications in emotional aspects of learning are worth exploring. Further, rather than researching cognitive, motivational, and emotional aspects of learning separately, we encourage scholars to tap into multiple learning processes with multimodal data to derive a more comprehensive view on the phenomenon of learning. In this case, the use of advanced educational technologies and tools is recommended, especially those tools that facilitate multimodality in learning.

This study has considered those papers studying learning processes with multimodal data, which uses the terms “multimodal” or “multichannel”. However, it must be acknowledged that in some publications, a multimodal approach could have been followed without explicitly using those terms. The current study also bears typical limitations of systematic reviews. It is possible that the search strategies employed, or the databases searched might not have included all the publications that are relevant for our research aims. In addition, the existing review is limited to the studies published in English. Future studies can extend the scope of the findings through the inclusion of non-English publications and searching for terms that might represent multimodality in a wider context than the current study.