Keywords

One of the main processed for evaluating an artifact (e.g., toward its indented use) and/or testing hypotheses on an artifact (e.g., UI, prototype, interaction technique) is experiment design. Experimental design is usually evaluating a particular system by means of statistical approaches. Detailed descriptions of experimental designs can be found in research textbooks and technical reports (e.g., Campbell & Stanley, 2015; McKenney & Reeves, 2018; Lazar et al., 2017; Cohen et al., 2002; MacKenzie, 2012). For the purposes of this book, we discuss four common experimental designs that CCI and learning technology researchers are likely to employ for their studies. These designs are considered “backbone” designs, in the sense that they leverage the core components that can be used to construct more complex designs. Therefore, understanding these four designs and their components will allow any CCI and learning technology researcher to also understand more complex designs, as well as to adapt and expand them to accommodate their needs.

Before moving on to the four designs, we would like to explain two core notions that will allow you to better understand them. Those two notions (or designs) are “between-subjects” (also known as “between-groups”) and “within-subjects” (also “within-groups”). The notion of between-subjects is very common, and because of its roots in clinical trials, it is considered to be the gold standard of experimental research, especially when combined with random assignment of groups. The main idea is that each subject (e.g., a learner or a child) is exposed to only one condition, either the control condition or an experimental condition. Afterwards, statistical analysis investigates the difference in the variable of interest between the control group and the experimental group(s). The notion of within-subjects entails that each subject is assigned to all the treatments, in a single or repeated manner and in a specified or unspecified order, depending on the needs and goals of the experiment. The main idea is that the same subject should be exposed to all the treatments, which allows them to serve as “their own control group.” Researchers can also combine those two designs in a mixed research design, which is, in effect, a within-subjects design inside a between-subjects design. This enables multiple comparisons but also increases the logistics and complexity of the study (which becomes, in effect, two studies). Such combinations and extensions of the basic research designs are not necessary for understanding the core designs, and are therefore beyond the scope of this note. Figure 5.1 shows how a simple experimental design with 12 participants and control and experimental conditions would look in the case of a between-subjects design, a within-subjects design, and a mixed design.

Fig. 5.1
Three panels of diagrams one below the other for between subjects, within subjects, and mixed. In all panels, samples are taken from P 1 to P 12 and checked both in experimental way and control way.

A between-subjects design, a within-subjects design, and a mixed research design using the same sample of 12 participants.

Those two notions are very powerful in CCI and learning technology research, and knowing their pros and cons allows researchers to make good choices. It is also important to highlight that there is no right or wrong research design. Instead, researchers should consider their needs (including contextual and disciplinary requirements) and make the most appropriate choice. Table 5.1 summarizes some common decision factors to bear in mind when considering the use of between- and within- subjects designs in CCI and learning technology research.

Table 5.1 Decision factors for choosing a between-subjects design or a within-subjects design.

Now that we have explained those two core notions and their inherent qualities, we can distinguish four experimental designs that are commonly used in CCI and learning technology. These are also “backbone” designs in the sense that they can be used to construct more advanced designs. First, we consider randomized experiments (also known as “true experiments”) that follow the between-subjects principles and use random assignment to create the control and experimental groups. Next, we consider quasi-experiments, which are mainly between-subjects (although you might see within-subjects experiments known as “repeated measures quasi-experiments”), with nonrandom assignment of subjects. The next design, repeated measures, is a within-subjects design in which all the subjects are exposed to all the conditions. Last, we consider the time series design, which is a quasi-experiment that employs repeated measurements, with the experimental condition(s) induced between the measurement periods. We will consider each of these designs in detail, but keep in mind that they constitute a basic set of research designs that are common in the fields of CCI and learning technology, and that they can be enhanced with “advanced qualities” such as counterbalancing, placebo confederates, and deceits, as required.

5.1 Randomized (True) Experiments

Randomized experiments are the ideal choice for maximizing the internal validity of a study. Their unique characteristic is that the subjects are assigned at random to a condition (the control group or the experimental group), which ensures that there are no significant differences between groups (Shadish et al., 2002). The random assignment eliminates any systematic error and ensures that the control and experimental groups are subjected to identical environmental conditions while being assigned to different conditions. This can be achieved by means of any random selection mechanism (e.g., a random numbers table, a random number generator app, or even tossing a fair coin).

A very simple example at the confluence of CCI and learning technology is a randomized experiment on the use of learning dashboards (i.e., a graphical user interface that visualizes students’ activity) to support secondary school students. The aim is to identify any potential effect of the use of a dashboard (the independent/manipulated variable) on students’ learning performance, such as their scores in weekly tests (the dependent/outcome variable). The students are assigned at random to either the control group (no use of dashboard) or the experimental group (use of dashboard). The experimental group is then exposed to the treatment (using the dashboard) for a period of time (e.g., 2 weeks). At the end of the period, we compare the learning performance scores of the two groups (Fig. 5.2).

Fig. 5.2
A diagram depicts the learning performance score for randomly assigned groups for a value of n equals 10 under two conditions, no dashboard L M S and dashboard-enhanced L M S using a line of vertical dots.

Example of a randomized experiment

5.2 Quasi-Experiments

Quasi means “resembling,” and quasi-experiments resemble experimental settings as far as possible without assigning subjects to conditions at random. Quasi-experiments allow the researcher to set the assignment of subjects to a control or experimental condition, depending on contextual factors, the ultimate goal, and any particular needs of the population in focus, according to some criterion (e.g., class, pre-test, or previous grades). In CCI and learning technology research, random assignment may be neither feasible nor practical, and in some cases it may not be ethical. A good example is research that occurs in school settings, where it is almost impossible to formulate random groups within a class environment and expose them to different conditions (and even if it is possible, it will result in very low ecological validity). Although such contextual factors preclude the use of randomized experiment, they lend themselves to the use of quasi-experiment. For instance, the researcher can expose two similar classes to the control and experimental conditions to identify the effect of the treatment (e.g., use of a technology) on the dependent variable (e.g., learning performance).

In a quasi-experiment, biases can easily be introduced. For example, schools may be included that have students with different socioeconomic statuses and different degrees of parental support; within a school, classrooms with different teachers or different curricula can be included. Accordingly, because of the lack of randomization, quasi-experiments face certain internal validity threats, and in many cases researchers use background information (e.g., students’ grades or previous performance) or pretests (or even pre- and post-tests) to strengthen internal validity. These additional processes are used to establish group equivalence and to remedy the lack of the equivalence that true experiments obtain through randomization.

In terms of the previous example, a quasi-experiment will assign class A as the control group (no use of dashboard) and class B as the experimental group (use of dashboard). The researcher can also check the average grades between the two classes, or even conduct a pre-test to make sure that there is good group equivalence on the GPA. The experimental group is then exposed to the treatment (using the dashboard) for some time (e.g., 2 weeks). At the end of that period, the researcher can compare the learning performance scores of the two groups (Fig. 5.3).

Fig. 5.3
A diagram for the quasi experiment depicts the learning performance score for class A and class B groups for a value of n equals 10 under two conditions, no dashboard L M S and dashboard enhanced L M S using a line of vertical dots.

Example of a quasi-experiment

5.3 Repeated Measures Experiments

A repeated measures design is a within-subjects design where all the participants are exposed to all the conditions. In practice, this means that each participant serves as their own control after being exposed to the treatment. In some cases, using the same sequence (e.g., control first, then experimental conditions) will work. Usually, however, a stronger design will involve randomizing or counterbalancing the order so as to eliminate any potential ordering effect. An example is when the participants need to be exposed to the same learning materials, or when they are likely to get tired, as familiarity or fatigue might affect future learning performance. Thus, randomizing the order can help us to remove any potential order bias. Counterbalancing can also be used to deal with potential order effects while reducing potential carry-over effects. To achieve complete counterbalancing, we need to make sure that all the participants have been balanced across all the possible condition orders. With two conditions (control and experimental), this is a simple matter. However, when the number of conditions increases, counterbalancing becomes more complex (see Fig. 5.4), with the number of potential orderings growing at a cubic rate C^2 (with C being the number of conditions).

Fig. 5.4
Two separate tables for counterbalancing with A, B, and C conditions. 2 cross 2 table for G 1 and G 2. 3 cross 3 table for G 1, G 2, and G 3. The conditions A, B, and C appear separately and once in each of the row.

Counterbalancing with two and three conditions (A, B and C are the conditions)

Returning to our previous example, a repeated measures study will expose all the participants to both conditions, probably in a randomized or counterbalanced way. For example, half of the participants will be exposed to the control condition first (no use of dashboard) and the other half to the experimental condition first (use of dashboard), and then this order will be reversed. At the end of the set period, the learning performance scores of the two groups will be compared (Fig. 5.5).

Fig. 5.5
A diagram depicts the learning performance score for all students under both control and experimental conditions for a value of n equals 20, for no dashboard L M S and dashboard-enhanced L M S using a line of vertical dots.

Example of a repeated measures experiment

5.4 Time Series Experiments

The last type of experimental design we will consider in this chapter is the time series design, which involves repeated measurement of a group with the treatment (or treatments) induced. With the democratization of learning analytics and other user analytics (e.g., data collected from user clickstreams, keystrokes, or sensor data), this type of design is becoming more and more popular in contemporary learning technology and CCI research. The absence of randomization and distinct experimental and control groups (as in true experiments) entails all the difficulties associated with a quasi-experiment, not least the impossibility of attributing changes in the dependent variable directly to the treatment (since an improvement or deterioration in the performance of the group participating in the time series may be due to other factors, known as confounds). The uniqueness of this research design lies in the use of continuous measurements. In some cases (such as high-frequency learning analytics), this high frequency diminishes the introduction of confounding variables, many of which are introduced over time in learning technology and CCI contexts (e.g., familiarity with the task/content). Time series designs can take various forms, some of which provide better internal validity than others (e.g., multiple additions and deletions or switching replications; for more details, see Shadish et al., 2002). However, in order to conceptualize and understand this design, we can think of a single-group time series with group (G) and measurements (M) that take place several times prior to and after receiving the treatment (T).

$$ \textrm{G}\kern1.58em \textrm{M}1\kern1.58em \textrm{M}2\kern1.58em \textrm{T}1\kern1.58em \textrm{M}3\kern1.58em \textrm{M}4\kern1.58em \textrm{T}2\kern1.58em \textrm{M}5\kern1.58em \textrm{M}6\kern1.58em \dots $$

To better grasp this design, let us imagine that we have no dashboard as the control condition, dashboard type 1 as an experimental condition 1 (T1), and dashboard type 2 as another experimental condition 2 (T2). In the first time segment (e.g., a class hour or day), a group experiences the control condition, and the respective measurements (e.g., surveys or analytics) are taken. In the second time segment, the group experiences dashboard type 1, and again the respective measurements are taken. In the third time segment, the group experiences dashboard type 2, and again the respective measurements are taken. At the end of the period, as with the previous experimental designs, the measurements taken (e.g., learning performance scores) for the control and experimental groups are compared (Fig. 5.6). The time segments can be randomized and/or repeated, and other techniques can be applied that enable us to increase the internal validity.

Fig. 5.6
A diagram depicts the measurements for all students under control, experimental 1, and experimental 2 conditions for a value of n equals 20 for, no-dashboard L M S, dashboard 1 enhanced L M S, and dashboard 2 enhanced L M S using horizontal dots.

Example of a time series experiment

A time series design is appropriate for longitudinal research designs and high-frequency data collections that involve a group or groups that are measured repeatedly, usually at regular intervals. It is important to remember that although time series design is a special type of quasi-experiment that takes advantage of the qualities of time (e.g., confounds that are introduced over time, altering/repeating conditions over time), it is also vulnerable to the weaknesses of quasi-experiments. Therefore, we should interpret the results with caution. Time series designs may also have specific weaknesses that must be addressed when analyzing the data. For example, they sometimes produce data points that are autocorrelated (for example, in very high-frequency data collection) and are therefore inadequate for certain statistical analyses (e.g., those that require independent data points; detailed information can be found in Kennedy, 1998). Conceptually, the primary concern is whether there is an exogenous influence (a confound variable) that takes place at the same time as any of the interventions (e.g., new or significantly different content, alterations to the instructions, or a bug in the system), or whether there are significant differences in the sample or an environmental condition (e.g., dropouts, fatigue, or changing classrooms).

In the past, time series design was mainly employed to detect unstable behavior patterns, and we see relatively few studies using this type of design (Ross & Morrison, 2013). Another reason for the limited use of this research design is the significant demands of longitudinal studies and prolonged involvement of human subjects. Moreover, we often find time series designs included under the umbrella term of “quasi-experimental design” (Campbell & Stanley, 2015).