In this section we will discuss the techniques employed in the present work. First, in Sect. 3.1 we describe the AR prototype, and we introduce the physical concept of torque. Second, in Sect. 3.2 we discuss a think-aloud session aimed at identifying problems with visualization and usability. Finally, in Sect. 3.3 we explain the approach adopted to measure the learning enhancement through the use of the AR app. In particular, we describe the test design and we discuss the statistical analysis and the scoring system used to assess the performance improvement.
Prototype
A prototype in the form of an AR application was developed and evaluated to see whether the learning process of the concept torque could be enhanced by means of visualization. The prototype was developed using the cross-platform game engine UnityFootnote 2 together with VuforiaFootnote 3, a software-development kit that enables the use of AR in mobile applications. The development of the software was done through an iterative process where the application was continuously evaluated by the three authors. When an alpha version was ready, a think-aloud user study focusing on usability and visualization was performed (see more details in \(\S\)??). The gathered data was taken into consideration when finishing the beta version of the app. In the application, torque is visualized through two examples (see Fig. 1): a wrench tightening a bolt to visualize torque applied at a point and a door opening and closing, to visualize torque applied with respect to an axis. Let us recall that the torque \(\overrightarrow{M}_\mathrm{O}\) of a force \(\overrightarrow{F}\) applied at point A, with respect to another point O, is calculated as:
$$\begin{aligned} \overrightarrow{M}_\mathrm{O}=\overrightarrow{r}_{\mathrm{OA}} \times \overrightarrow{F}, \end{aligned}$$
(1)
where \(\overrightarrow{r}_{\mathrm{OA}}\) is the position vector (which defines the location of A with respect to O), \(\times\) denotes vector product and \(\overrightarrow{(\cdot )}\) indicates vector quantities. The user can control the magnitude of the position vector \(\overrightarrow{r}_{\mathrm{OA}}\), as well as the magnitude and the orientation of the force vector \(\overrightarrow{F}\). When the force is applied by the user pressing the “play” button, the torque vector and its projected component on the directions normal to the plane (wrench example) and parallel to the axis of rotation (door example), are shown on the screen. The corresponding object moves according to the applied torque, and the calculation of the corresponding vector product can be observed in the upper part of the screen. In Fig. 2 we can see an example of the application being used, and a video of its usage can be found in the Supplementary Material 1.
The two different examples were made using basic 3D models provided in Unity, mixed with imported, pre-made models such as the wrench and bolt, and a self-made model for the arrow. The arrow was modelled using the 3D modelling program Blender. After creating the scenes they were placed on image targets, which are necessary for the AR app to know where to render the modelled scene. When the Vuforia-powered app recognises the target through the phones camera, the app renders the model placed on that specific target. In order to get the model to move, Unity’s built-in functionality of applying e.g. force and torque was used. Unity allows to classify objects as rigid bodies, thus opening up the possibility to apply real-life physics to the objects. Through scripting, the forces applied to the objects were regulated by controls, visualized by the arrows and the torque was activated when pressing the play button.
Visualization and Usability
To evaluate the usability of the AR application (also denoted as app in this study), a user study in the form of a think-aloud session was performed with a group of three graduate students who studied both visualization and interaction design. The participants’ continuous loud thinking allows for usability mishaps to be detected during the process of completing a particular task. These students were considered to be able to provide relevant feedback since they all have a background in both interaction design and visualization studies. In individual sessions, the participants were first instructed to apply a torque to the displayed figure. Since this is the main function of the application, it was considered relevant to investigate how intuitively could this task be carried out. After completing the first task, the users were asked whether they could find additional information regarding the concept of torque in the app. Finally, the participants were encouraged to provide feedback on possible improvements of the usability and visualization of the app. No questions were answered, nor any help was given by the interviewer during the session. The user study was performed on an iPhone X running with iOS 13.1.3. All participants had previous basic knowledge of torque, although it was acquired many years ago. This session resulted in qualitative data that was taken into account when finishing the development of the prototype.
Measuring Learning Enhancement
Despite the assumption that individuals learn by doing, the potential learning enhancement needs to be proved and therefore measured. Kirkwood and Price (2014) stated that many studies discuss enhancement without defining exactly in which sense, and thus in many cases the approach adopted to measure any improvement is not appropriate. They emphasize the importance of clearly stating how one defines enhancement, not assuming that technology enhances learning, and carefully choosing relevant methods of measuring. In this work, we consider what Kirkwood and Price (2014) denoted as supplementing existing teaching. The most common (and perhaps most objective) way of measuring learning outcomes when adopting this approach is to assess quantitative improvements, and this is what we will consider as enhancement in this work. The improvement can be measured using pre-test/post-test results, i.e. having the participants take a test before the experiment, then retake a similar test after the experiment and compare the results. The outcome will also be compared with the results of a group that takes the same tests but does not use the AR application in between. Note that, although the quantitative improvement is an objective measurement, it lacks the possibility of portraying the quality in learning enhancement (Kirkwood and Price 2014). Thus, we complement the testing with a qualitative form investigating the experienced improvement or stagnation in understanding of the studied phenomenon. The evaluation method in this study therefore followed a convergent parallel design (Creswell and Clark 2017), gathering both quantitative and qualitative data simultaneously.
The most important aspect to evaluate was whether the developed application could enhance the learning of torque among mechanical engineering students. To this end, after an initial lecture on the subject, a pre-test consisting of one question regarding torque was carried out. The students then got to volunteer for the second user study: a lab session using the AR application to answer questions regarding torque. The session ended with the students filling out a qualitative form. Finally, a post-test consisting of multiple questions was held, where two of them examined the concept of torque (note that one of them was similar to the one in the pre-test). The results of the pre- and post-test constitute the quantitative data of this study, complemented with qualitative data from the previously mentioned form. The evaluation approach is described in more detail in Sect. 3.3.1.
Test Design
The students in the Mechanics I course participated in a lecture on torque. A couple of weeks after the lecture, the students were given the opportunity to solve a problem on torque, without any particular preparation, and this problem was graded. This task is labelled as pre-test in the Supplementary Material 2. After around one week, the students were invited to participate in a workshop to practice the interpretation and calculation of the torque by means of the AR app. The activities in this workshop are labelled as AR Session in the Supplementary Material 2. Around four weeks after this, the students took a partial exam in the Mechanics I course, where two problems were related to torque calculation. These two tasks are labelled as post-test in the Supplementary Material 2. Note that not all the students who took the post-test were part of the pre-test or the AR session, and a schematic representation of this is provided in Fig. 6. A total of 26 students volunteered to participate in the workshop, which was designed as a user study with the developed AR prototype. It is important to note that when experimenting with early prototypes of technological interventions in a learning context, it is beneficial to run pilot studies of modest sample sizes to focus first on the usability of the technology before delving deeper into the impact the tools may have on the learning outcomes. These studies are typically short, from a few hours to a few weeks, and involve few participants, from 5 to 30. Studies aimed at learning outcomes will typically recruit at least tens of participants and last at least several weeks if not months or even years. The present study is a pilot to determine the usability of the technological intervention and a pre-investigation to forecast possible impacts. The course instructor acquired consent from the program director to run this study and announced in class to the students: “My collaborators and I would like to recruit your participation in a research study. The aim of the study is to test the learning potential of an augmented-reality smartphone application aimed at understanding mechanical torque. Your participation will be anonymous and voluntary. You may cease to participate even after you volunteer without any penalty to you. Your performance in the study will not be a measure of your ability to learn but of the application to support learning. Your performance in the study will not negatively impact your performance in this course in any way. On the other hand, while we will not reward or compensate your participation with academic credit or financially, your participation may improve your learning of the core material of this course and this improvement would materialize as a higher final grade. Participation in this study does not pose any physical or personal risk higher than that present in normal office and classroom conditions. You will not interact with hazardous materials or in hazardous conditions. All the material gathered about your participation will be kept anonymous. Please, contact me by email if you wish to participate. Thank you for considering to participate. Do you have any questions?”. By contacting the researcher through email, participants stated their consent to participate under the conditions stated above. From the 26 volunteers, 12 actually attended the session. The students could work individually but they were encouraged to work together since it has been shown that cooperation (i.e., in the context of peer instruction) increases engagement (Crouch and Mazur 2001). Since having the goal of performing a specific task has been proven effective for closing the gap between what is expected to learn and what is actually learned (Hattie and Timperley 2007), the lab session included a task sheet where the example illustration of the tasks were the image targets needed for the AR application (see the AR Session section of the Supplementary Material 2). The tasks were exploratory, focused on trying to get the students to understand the concept of torque rather than performing specific calculations. The session ended with the students being asked to fill out a form, which was meant to provide complementary qualitative data regarding experienced improvement in understanding. The form was also intended to lead to a self-evaluation on their general motivation in the course, the latter in order to know the ambition level of the students who participated. As stated above, few weeks later a post-test consisting of multiple questions was carried out in the Mechanics I course. Two of the questions were about torque, and one was very similar to the question in the pre-test. The results could then be compared with those of the pre-test, and further compared with the results of the group who did not use the application. These results constitute the quantitative assessment of the improvement in understanding of torque. Note that the pre- and the post-tests, together with the tasks and the questionnaire in the AR session, can be found in the Supplementary Material 2.
Statistical Analysis and Scoring System
We estimated that the study would need to include about 500 people to statistically determine the smallest difference of proportion observed with a power of 80% and significance level of 95% (see Sect. 4.3 for the calculations), while the amount of available students was only 39. It is important to note that this study does not aim at reproducibility of the learning outcomes. It lacks the statistical power for that. Rather, it is a demonstration of the feasibility of the technology. It focuses on usability and utility. We address the questions: “Can engineering students use it?”, and “Do they get and perceive value by using it?”. In order to answer the question “Can people learn more by using the tool?” we would need more people and a longer study for that. As a starting point of such a quantitative study, statistical analyses were performed to properly evaluate any performance improvements. The first analysis was a statistical hypothesis testing of proportions within the same population during different stages, a test which aims to examine whether possible differences between proportions are statistically significant. The procedure starts with declaring a null hypothesis (\(H_{0}\)), which states that there is no difference between the two proportions being compared (\(p_{1}\) and \(p_{2}\)). Secondly, an alternative hypothesis (\(H_{A}\)) is defined, stating that there is a difference between the proportions. That difference is defined according to whether the test is one- or two-sided. If the test is two-sided, the \(H_{A}\) will be the opposite of \(H_{0}\), i.e. the proportions are not equal. On the other hand, if the test is one-sided the test will be more effective, but only one of the options \(p_{1} > p_{2}\) or \(p_{1} < p_{2}\) can be investigated. This is formulated as follows:
$$\begin{aligned} H_{0}: p_{1} = p_{2}; \quad H_{A}: p_{1} \ne p_{2} \text {,} \quad p_{1} > p_{2} \quad \text {or} \quad p_{1} < p_{2}. \end{aligned}$$
(2)
The considered proportions were the students who passed the pre-test (\(p_{1}\)) and the proportion of students who passed the tasks on torque in the post-test (\(p_{2}\)) in each of the groups, i.e. the one that used the app and the one that did not. Due to the small sample size in this study, a one-sided approach was chosen and only the scenario that \(p_{1} < p_{2}\) was analysed to maximize the possibility of seeing any possible improvements. This was a relevant approach since the raw data showed positive tendencies of change rather than negative, i.e. the students had better results on the post-test compared to the pre-test. Furthermore, this decision was motivated due to the results seen in the form, where a large majority answered that the app contributed to their understanding of torque. Hence, there was no reason to believe that the app would have caused negative effects on the students’ test results. The statistical analysis was performed using the software environment for statistical computing RFootnote 4, through the graphical user interface (GUI) R Commander. We ran the test prop.testFootnote 5 with a confidence level of 95 %.
When comparing the results between the groups, a grading system was developed in order to be able to analyze the average results of both groups in a reasonable way due to the limitations caused by a small sample size and the grading of only pass or fail. Note that this grading system differs from the one adopted in the actual exam, because the present system aims at assessing the improvement with respect to the pre-test. The system was designed to favor improvement in the test results and especially regarding task 2 in the post-test since it was the most similar to the task in the pre-test. It is important to note that the tasks are labelled in the Supplementary Material 2, and the notation task 2 and task 3 refers to the order of the tasks in the exam, which included problems on other topics. The scores for task 2 therefore depend on the result from the pre-test while task 3 is scored individually. Hence, a student who did not pass the pre-test but passed both tasks on the post-test would receive the highest score while a student who passed the pre-test and failed both tasks on the post-test would receive the lowest score. The points were assigned as shown in Fig. 3. The average score from each group was calculated and Welch’s t-test was performed to compare those averages. Welch’s t-test evaluates the hypothesis that two averages generated from groups of different sizes are equal. Thus, the hypotheses are formulated as follows:
$$\begin{aligned} H_{0}: \mu _{1} = \mu _{2}; \quad H_{A}: \mu _{1} \ne \mu _{2}, \end{aligned}$$
(3)
where \(\mu _{1}\) and \(\mu _{2}\) are the averages in each group. This analysis was also performed using R, through t.testFootnote 6 with a confidence level of \(95\%\). The sample populations of the engineering students at a major upper education technical university like KTH are not a representative sample from general populations. They have self-selected into the upper 25th percentile of their graduating cohorts from high school. Their grade performance is skewed toward the top of the scales. Once the scales are re-normalized for higher-level education, the curve typically remains skewed towards the upper levels of achievement in grading criteria. Yet and according to Borg et al. (1989), in terms of learning outcomes and skills acquired, the underlying distribution remains normal, thus justifying the use of Welch’s t-test in the present context.