Introduction

Many classrooms have employed blended learning models that integrate both live interactions and self-paced learning with educational software that profess to tailor instruction to individual needs (Heinrich et al., 2020). In particular, intelligent tutoring systems (ITS) have been commonly used as supplements to address student learning gaps and are often advertised to improve student learning outcomes. While some research studies have shown promise in this regard (Koedinger et al., 2010; Pane et al., 2014), historically, the effectiveness of ITSs has been mixed and remains in question (Fang et al., 2019; Kulik & Fletcher, 2016; Reich, 2020; Sun et al., 2021) as ITSs and blended learning technologies are not always easy to implement in classroom environments (Amro & Borup, 2019; Brasiel et al., 2016; Karam et al., 2017; Phillips et al., 2020; Ritter et al., 2016).

In their meta-analysis on the effectiveness of ITSs, Kulik and Fletcher (2016) found that teacher experience with the tutoring program as well as fidelity on the intended use of the program is absolutely critical for realizing meaningful improved learning outcomes. Relatedly, Phillips et al. (2020) found that teachers face many challenges in integrating the adaptive tutoring system ALEKS for high school algebra with their pedagogical practice. The study reported that only four out of 40 teachers regularly considered ALEKS data to help inform lesson planning. Despite low usage among teachers and students, by and large, teachers in the Phillips et al. (2020) study reported ALEKS met a critical need because it delivered remedial instruction in foundational skills, which was missing from the standard algebra curriculum. In light of the incongruence between teacher buy-in and teacher usage, we theorize that the strengths of the program may not easily accommodate common pedagogical practices in the classroom such as teacher-led differentiated instruction or collaborative learning. This is not to say, however, that the strengths of the program cannot be leveraged for these purposes. Indeed, we contend that the ALEKS system offers a highly unique and novel opportunity for supporting teacher-led differentiated instruction and eliciting collaborative learning in a purposeful way.

The overarching question we seek to investigate is this: How can we better align the personalized instruction provided by ALEKS with teachers’ pedagogical practices? Our proposed solution is to use ALEKS learning data to form homogenous or heterogeneous groups of students along with providing teachers with content recommendations that each group could focus on. We hypothesize that such groupings and recommendations could in turn help teachers implement differentiated instruction using small groups, although we do not test this hypothesis in this paper. Specifically, we develop three algorithms that form groups using data from the ALEKS adaptive assessment and the system’s mastery-based learning mode, which provide information on student knowledge and readiness on hundreds of mathematics skills. Importantly, our methods not only identify groups of students—they also identify the ALEKS content that each group should focus on. More broadly, we hope to show how viewing student ability as a multidimensional construct could potentially allow teachers to implement differentiated instruction in more powerful ways than by simply viewing student ability as a fixed unidimensional construct, as is the case in the common pedagogical practice of ability grouping (Jensen & Lawson, 2011; Leonard, 2001; Linchevski & Kutscher, 1998; Webb & Kenderski, 1984; Wesson, 1992) as well as existing group formation algorithms (Abnar et al., 2012; Christodoulopoulos & Papanikolaou, 2007; Graf & Bekele, 2006; Liang et al., 2021; Manske et al., 2015). We do this by highlighting two central takeaways that may be counterintuitive or may go unrecognized under a unidimensional construct of ability. The first is that groups of students who may be traditionally perceived as having different ability levels may show commensurate preparedness on a specific area of the curriculum. The second is that students who show an overall lower ability in comparison to their peers, may possess specific strengths that make them relative “experts” in some area of the curriculum.

We present three efficient and easy-to-use algorithms that leverage ALEKS data to suggest within-classroom groupings of students. Each method addresses a specific use case, yet each produces groupings that could be used in a variety of classroom settings. That is, any one of our methods may be used for a variety of pedagogical practices where segmenting the class into small groups may be useful. Such practices include, but are not limited to collaborative problem solving, inquiry learning, and conceptual teacher-led lessons. Moreover, any one of these practices may or may not use ALEKS as the source of instruction or learning. For example, a teacher may want students to engage in group problem solving while working within ALEKS, or a teacher may wish to group students based on their ALEKS data for the purpose of working on group activities outside of ALEKS. Anecdotally, we have spoken to educators who are interested in leveraging ALEKS data for the purpose of delivering conceptual lessons to small groups of students. Regardless of the specific practice, the unifying component of our methods is that they leverage the ALEKS data to support a variety of practices suitable for small groups. Our three grouping methods are described as follows:

  1. (1)

    Within-module grouping method: A method that groups students homogenously based on their readiness to learn a pre-determined module in ALEKS that the teacher assigns to the entire class. This method is designed for when the teacher is teaching a sequenced curriculum, but still wants to differentiate instruction based on students’ prior knowledge relative to each module. This method applies a k-means clustering algorithm for which the groups are not necessarily of equal size.

  2. (2)

    Curriculum-wide grouping method: A method designed for an ALEKS class without pre-determined modules, where students are free to work on different skills across the curriculum at their own pace. This method simultaneously identifies (1) homogenous groups of students in that they are all ready to work on a specific spot in the curriculum, and (2) the spot in the curriculum that each group of students should work on. More so than the previous method, this method is likely useful in classes focused on remediation or self-directed learning. Here, groups are intended to be of equal size of \(n\ge 3\), which is the only parameter needed to form the groups.

  3. (3)

    Reciprocal pairing method: A method that simultaneously (1) forms student pairs that are matched based on complementary knowledge to support Reciprocal Peer Tutoring (RPT) in which students interchange tutor and tutee roles, and (2) determines the content that each pair should focus on. Complementary knowledge refers to when each student in the pair possesses some knowledge that the other does not. This method is also designed for ALEKS classes without pre-determined modules, as it is best applied in this context since it aims to find individual strengths of each student from among the entire curriculum.

For each of our proposed methods, knowledge and readiness are determined by the student model in ALEKS. This modeling represents student knowledge and readiness across several hundred skills, which provides the multidimensional notion of ability that we use in this study. This highlights a novel approach to group composition, which is typically determined by ability in the form of a single or small set of assessment scores. Also, because both readiness and knowledge do not remain fixed across several hundred skills, these methods are designed to provide groupings that are data-driven and flexible, allowing for regrouping at any given time. By taking a fine-grained view of ability over hundreds of skills and providing flexibility for regrouping, our methods are equipped for making more nuanced groupings that may be difficult to attain from a single score. In addition, it is worth reiterating that our methods also identify the content that the groups (or pairs) of students should focus on, which is a novel contribution in its own right. The reasoning and mechanisms used to accomplish this in each method are described in Sections Method Specifications, Method Specifications, and Method Specifications, respectively. To evaluate our methods, we compare each method with a grouping algorithm that assigns students to groups randomly, and a method that groups students using a unidimensional measure of student ability.

Our paper presents one possible solution for how to bridge between personalized learning technologies and teachers’ pedagogical practices. We view the rich data about student knowledge that personalized learning technologies like ALEKS provide as such a bridge. Our particular solution is to use this data to automatically form small groups of students to support various forms of small group differentiated instruction in the classroom. However, this is just one of many possible ways in which multidimensional data on student learning could be used to support differentiated instruction in the classroom. Ultimately, in order to determine the utility of algorithms such as ours, they must be tested in actual classroom settings. We do not present an evaluation of our algorithms that actually form groups of students in classroom settings. Therefore, we cannot make any claims as to whether groupings using our algorithms would result in improved outcomes such as higher academic achievement. We only form mock groupings of students using historical data to assess the quality of groups using various metrics as compared to other methods. While evaluation studies are known to be quite popular in the field of group formation algorithms in collaborative learning, there is little known work that takes a technical validation approach, which present novel computational techniques that have not yet been implemented in practice (Cruz & Isotani, 2014). The current work should be viewed as a technical validation research study that is meant to test the feasibility of such techniques before evaluation can take place.

This paper is organized as follows. In Section Related Works, we will provide a review of selected works on small group formations, both in traditional classroom settings and in computer-supported collaborative learning, as well as how these approaches compare and contrast with the current work. In Section A Brief Overview of ALEKS, we will follow up with a brief overview of the adaptive intelligent tutoring system ALEKS, which is an instantiation of Knowledge Space Theory (KST). We will also provide a basic background of KST for introducing terminology that will help us better understand our grouping methods. In Section Overall Approach, we will begin to introduce the current study with a description of our overall methodological approach for evaluating our grouping methods and the data we use. In each of Sections Within-Module Grouping Method, Curriculum-Wide Grouping Method, and Reciprocal Pairing Method, we will describe one of our methods (including the goal of the method), metrics for evaluation, and results that lead to insights with regard to overall student ability and evidence supporting the conventional wisdom that each student has the potential to be a meaningful contributor in the learning process of their peers. Sections Within-Module Grouping Method, Curriculum-Wide Grouping Method, and Reciprocal Pairing Method are designed to be modular where each section may be reviewed independently without prior familiarity of the other two. This is done so that readers have the option of reviewing whichever method(s) they feel best align with their specific interests or needs. Finally, in Section Discussion, we discuss the overall implications of this work, discuss limitations to our study, and mention potential future directions.

Related Works

Group Formation in the Classroom

A critical consideration when it comes to forming groups in a classroom setting is group composition. Group composition is often determined by student ability, which is typically represented by a single or small set of assessment scores. These scores are often used to place students in either heterogeneous or homogenous groups (Donovan et al., 2018; Harlow et al., 2016; Leonard, 2001; Murphy et al., 2017; Wyman & Watson, 2020). This may be accomplished by ranking students across multiple assessment scores or probes that are then used to form homogenous groups where student scores produce a combination of like measurements and/or heterogeneous groups where student scores reveal a mixture of low, average, and high scores (Donovan et al., 2018; Harlow et al., 2016; Murphy et al., 2017; Wesson, 1992; Wyman & Watson, 2020). Other forms of composition may involve grouping students based on gender, learning perspectives, student attitudes towards group work, personality traits, and student online learning engagement (Donovan et al., 2018; Kanika et al., 2022; Sanz-Martínez et al., 2019). While these approaches may incorporate several assessment scores, probes, and/or other attributes simultaneously, they incorporate a relatively low-dimensional view in comparison to the high-dimensional view of ability we adopt in this paper.

Numerous studies have shown benefits for both heterogeneous and homogenous groupings. Namely, students in heterogeneous groups have benefited from effective learning and increased achievement (Jensen & Lawson, 2011; Leonard, 2001; Linchevski & Kutscher, 1998; Manske et al., 2015), especially among lower performing students who are given the opportunity to learn from higher performing group members (Donovan et al., 2018; Murphy et al., 2017). Students in homogenous groups have been known to experience higher levels of inquiry learning, and task-completion as well as higher levels of peer interaction and collaboration due to more favorable learning experiences with regard to compatibility (Fuchs et al., 1998; Kanika et al., 2022; Sanz-Martínez et al., 2019). Interestingly enough, however, ability grouping has been widely criticized for its lack of student mobility across groups, which in turn often negatively affects low-achieving students (Castle et al., 2005). One potential solution for addressing the negative effects of ability grouping is flexible groupings, which are designed to be data-driven and non-static groupings that allow for the regrouping of students to address a wide range of student needs (Bates, 2013; Castle et al., 2005; Hoffman, 2002). Because ability is often viewed as unidimensional, we theorize that this could contribute to the lack of student mobility, as students are either viewed as low or high ability.

For the grouping methods introduced in this paper, we adopt a fine-grained multidimensional view of ability, which is less about viewing students as either low or high ability, and more about identifying the strengths and weaknesses of students to address specific needs in different parts of the curriculum. Connor et al. (2018) adopted a highly similar view of ability while investigating assessment to individualize early mathematics instruction. The aim of this study was to form groups of children with similar learning needs based on assessment results and then provide mathematics learning opportunities that were consistent with assessment data. For each student, assessment reports were generated in a fine-grained format showing skills mastered and not mastered across 93 math items ranging in difficulty. Teachers then used these results, which were organized in a table (resembling a heat map) to form flexible groupings. Children were grouped based on similar proficiency across the 93 items at different points in time, allowing for group membership to change based on their progress. Although a rich set of fine-grained assessment data was used to make informed groupings of students, perhaps the most significant difference between the grouping approach studied by Connor et al. (2018) and the current study is that we aim to support teachers by automating a process that could otherwise be time-consuming and difficult to do manually for most teachers.

Computer-Supported Group Formation

Many researchers have developed algorithms for automated group formation (Christodoulopoulos & Papanikolaou, 2007; Connor et al., 2007; Lawrence & Spuck, 1979; Liang et al., 2021; Manske et al., 2015; Redmond, 2001). These algorithms typically group students based on attributes such as measurements of ability (e.g., assessment and quiz scores), motivational scores, learning styles, and personality traits (Maqtary et al., 2019). One popular approach to automated group formation involves using machine learning algorithms that compute Euclidean distances in a multi-dimensional vector space between student feature vectors that store data on various attributes as mentioned above. As we will see in Section Method Specifications, our first grouping algorithm, the within-module grouping method, takes a similar approach by utilizing the unsupervised machine learning algorithm, k-means, to cluster students into homogenous groups. While we do not incorporate personality-based attributes or affective attributes such as motivation or self-efficacy, our grouping algorithms are distinct from previous work in that they use a multidimensional notion of ability that looks at students’ knowledge over several hundred skills.

Another technique used in group automation is to rank students based on student attributes so that groups may be evenly formed with respect to certain conditions as much as possible. In a sense, this attempts to optimize the group formations, but because such tasks can often be computationally expensive, these automations may apply a greedy approach where matches are made repeatedly in succession based on some rank order (Liang et al., 2021; Redmond, 2001). As detailed in Section Method Specifications, we use a similar approach when pairing students with complementary knowledge. A greedy approach is used instead of finding an optimal configuration because such an optimal solution for a typical class size of 15–25 is extremely expensive computationally and perhaps impossible. In response to the absence of an optimal solution, in Section Results we highlight how such a greedy approach could be performed many times to result in better solutions.

Grouping algorithms have also been incorporated into learning software systems to assist students and teachers. Connor et al. (2007) studied the effects on algorithms that group and make individualized instructional recommendations on the web-based software Assessment to Instruction (A2i). The current study shares a couple key motivating points with Connor et al. (2007). First, individualizing instruction with the use of A2i did not present a new reading curriculum but rather a new way of implementing the reading programs. Similarly, the grouping algorithms introduced in the current study are not meant to replace the ALEKS curriculum nor any other curriculum blended with ALEKS, but rather they are meant to provide teachers with recommendations for differentiated instruction and collaborative learning. Second, the A2i algorithms used children’s reading scores to help teachers effectively group the children for differentiated instruction of similar reading levels. Two of the three methods in the current study also aim to group students of similar ability levels for the purpose of differentiated instruction.

Another system that features a grouping algorithm is TECMap (Technology-Enhanced Concept Mapping), introduced by Dragon and Mitchell (2018). This system leverages teacher-defined concept maps that relate assessments and materials for organizing a course or entire curriculum program. From assessment data, the system estimates a student’s knowledge and represents this on a concept map to recommend resources from the curriculum. Additionally, student knowledge estimates may be used to automate groups of students with similar abilities and identify common concepts for the group to focus on. While we share similar qualities to TECMap, we highlight two significant differences. First, the domain model of TECMap is teacher constructed, while the ALEKS domain model is based on Knowledge Space Theory, which we will introduce in Section A Brief Overview of ALEKS. Second, the concepts that make up a TECMap map are very few (ranging from 10 to 40) in comparison to the hundreds of skills captured in an ALEKS course structure, which underscores the high-dimensional uniqueness of our work. Moreover, Dragon and Mitchell (2018) did not use data to compare properties of their grouping method against any baseline grouping methods.

Peer Tutoring

Constructing small groups based on ability is useful, as they often help teachers deliver differentiated instruction to a small set of students with similar ability. However, a drawback and criticism of ability grouping is that it minimizes diversity in the group and gives less opportunity for students to learn from each other (Wesson, 1992). Therefore, peer tutoring, in which one student takes on the role of the tutor and the other takes the role of the tutee (Heller et al., 2004) may be implemented to address deficiencies in peer interactions and collaboration. Programs adopting such strategies have been shown to be beneficial for both parties involved (Cohen et al., 1982). In their meta-analysis on educational outcomes of tutoring, Cohen and colleagues report that both tutees and tutors experience positive attitudes toward the subject matter covered in the tutoring program. They also found that tutees see larger learning gains over students in control groups who do not receive tutoring.

The potential of peer tutoring has also been recognized within the context of ITSs (Heller et al., 2004; Hoppe, 1995; Wessner & Pfister, 2001). In their proposal for peer tutoring recommendations, Heller et al. (2004) describes an “ask a friend” pairing feature that could be used in the e-learning system RATH. The system RATH, which stands for Relational Adaptive Tutoring Hypertext, like ALEKS, is an adaptive intelligent tutoring system based on Knowledge Space Theory (KST). In their collaborative RATH concept (CRATH), the authors propose a collaborative environment in which a student would be able to publish annotations and questions on learning documents encountered by the learner. In the case of a question, the system would select other learners with the appropriate knowledge and ask for their willingness to answer the question. A similar approach was introduced earlier by Hoppe (1995) in work describing system architecture and dialogue design that leverage multiple student modeling for the pairing of students who ask for assistance with peer helper(s). This work employed KST-like concepts where the student model determined the set of skills (from a set of “knowledge elements”) a student knows as well as the set of skills the student does not know. This made it possible for the system to pair a student who lacked knowledge on a skill with a helper who had mastered the skill. The pairing algorithm introduced in the current work takes a similar approach, however, unlike Heller and Hoppe, we attempt to go a step further in pairing students who could legitimately serve as helpers (or tutors) to each other on different skills.

Cohen et al. (1982) reported that tutors benefit from tutoring programs, as tutors tend to develop a deeper understanding through explaining concepts. Much of the work on the benefits for tutors has been studied through investigating Reciprocal Peer Tutoring (RPT) in the classroom. RPT is a form of collaborative learning that involves students of similar academic background experience interchanging roles of the tutor and tutee (Gazula et al., 2017). By and large, studies report mixed results as it relates to the benefits of RPT when it comes to student motivation, interest, self-efficacy, test anxiety, and academic achievement (Cheng & Ku, 2009; Choudhury, 2002; Fantuzzo et al., 1990; Griffin & Griffin, 1998; Mickelson et al., 2003; Rittschof & Griffin, 2001). RPT studies have generally been conducted outside the context of computer-supported group formations. Typically, students who are assigned to participate in RPT sessions are randomly paired (Fantuzzo et al., 1989a, b; Rittschof & Griffin, 2001). The RPT sessions are usually carried out by having each student develop their own set of 10 to 20 questions for their partner to answer. Upon completion, the questions are graded by the peer tutor, which then initiates an exchange of tutoring on questions answered incorrectly by the partner.

The typical setup of RPT sessions raises some reasonable concerns that may limit the effectiveness of the interventions used in of prior studies. First, the concern on whether random pairs of students are appropriate for an RPT exchange is largely avoided in prior work. After all, it is reasonable to suspect some pairs would be ineffective at tutoring their respective partner if their knowledge and/or deficiencies of knowledge are closely aligned with one another. Similarly, if students’ abilities in the pair are drastically disparate, RPT sessions could result in a heavily one-sided exchange where one student is doing most or all of the tutoring. Another concern lies with the actual questions developed by each partner and the self-grading procedure. The issue of whether such questions were substantively and accurately developed and graded should be a concern, as these may have inadvertent consequences for research questions and findings. Lastly, the methodology used to find content for which one student can tutor the other student on (i.e., developing questions that one’s partner answers incorrectly) could be quite cumbersome, which may affect student motivation and interest. Indeed, Cheng and Ku (2009) reported that some students felt RPT sessions were unnecessary and busy work.

There has been research, however, on peer tutoring that avoids some of the aforementioned concerns. The Fuchs Research Group at Vanderbilt University has done extensive work on peer-mediated instructional programs employing Peer-Assisted Learning Strategies (PALS; Fuchs et al., 1995, 1997; Phillips et al., 1994; Stecker et al., 2005). In contrast to random pairs, PALS uses an assessment (Curriculum-Based Measurement) to determine which students need assistance on which topics and then pairs students so that one student (the tutor) has mastered the topic and the other student (the tutee) needs assistance with it (Fuchs et al., 1995). (The exact details of how the pairs were constructed was not described in their studies). However, the pairs did not practice reciprocal tutoring in this case. In order to ensure all students are sometimes able to play the role of the tutor, pairings were re-assigned every two weeks and any student who had not been a tutor in the previous four weeks would be assigned as a tutor. A later version of PALS did implement reciprocal peer tutoring (Fuchs et al., 1997). Student pairings were formed using a heuristic that paired the highest-performing and lowest-performing students together, the second highest-performing and second lowest-performing students together, and so on. The higher-performing student in each pair was the first tutor, and the other student was the second tutor. Importantly, instead of the tutoring happening on student-developed questions, PALS identifies a skill (already taught by the teacher) for which one student in the pair has mastered and the other has not. The methods described above closely mirror the pairing method introduced in the current work, however, our method seeks to pair students with complementary knowledge (i.e., where both students in the pair have learned a concept that the other has not).

The reciprocal pairing method introduced in this paper seeks to address some of the concerns outlined above. To our knowledge, this work marks the first study interested in the formation of reciprocal pairs with complementary knowledge. Our approach simultaneously forms pairs of students where each student has learned some content that the other has not, and also identifies the content that each pair should focus on. This allows us to avoid issues surrounding student-developed questions, which may lead to frustration (if inaccurately developed), and also avoid having one student tutoring their peer on a topic that their peer is more knowledgeable about. It is important to note that prior work on RPT and PALS has focused largely on different ways of scaffolding peer tutoring that go beyond how to pair students. Thus, our algorithm could potentially be combined with efficacious pedagogies for scaffolding peer-mediated instruction.

A Brief Overview of ALEKS

Basics of Knowledge Space Theory

ALEKS is an artificially intelligent learning and assessment system that has been used by millions of students for math, chemistry, statistics, and accounting (About ALEKS, 2021). The system’s domain model is based on Knowledge Space Theory (KST), which is an approach to the assessment of knowledge that is based on a combinatoric and probabilistic model introduced by Doignon and Falmagne in 1985 (Cosyn et al., 2021; Doignon & Falmagne, 1985). In KST a student’s knowledge (or knowledge state) is simply the set of skills known by the student, or the set of items the student can complete correctly unaided. In ALEKS, as well as in KST, an itemFootnote 1is a problem type that covers a discrete piece of knowledge in an academic course. In KST, a student’s knowledge is represented in a knowledge structure, which is a collection of subsets of the set of all items in an academic course. Any one of these subsets may represent a student’s knowledge state, or simply state, at any given time (Cosyn et al., 2021). Additionally, a knowledge structure captures prerequisite relationships among the items in the course (Desmarais et al., 1995; Doignon, 2014). The notion of prerequisites is an integral component to the ALEKS model and our grouping methods as we will encounter later. Other fundamental concepts to the ALEKS model are a student’s inner fringe and outer fringe. The inner fringe of a student’s state \(K\) is the set of items \(q\) such that \(K\backslash \left\{q\right\}\) is also a feasible state. Conceptually, the inner fringe can be thought of as the set of items that the student can build directly from. The outer fringe of a student’s state \(K\) is the set of items \(q\) not in \(K\) such that \(K\cup \left\{q\right\}\) is also a feasible state. Conceptually, the outer fringe can be thought of as the set of items the student is ready to learn. The concepts of the inner and outer fringes are somewhat related to Vygotsky’s (1930–1934/1978) notion of Zone of Proximal Development (ZPD). In the case of the outer fringe, the student has acquired enough knowledge to learn new items with the guidance of either the system’s pedagogical model or a more experienced individual such as a teacher or human tutor.Footnote 2

In later sections, we will frequently revisit the idea of the outer fringe, as it serves as a pillar for each of our grouping methods. Because the ALEKS system is able to keep track of a student’s state and outer fringe over hundreds of discrete items in an academic course, this presents a novel opportunity to investigate group formations based on these KST concepts. More specifically, this presents an opportunity to devise grouping algorithms that take a more nuanced approach by viewing ability as a multidimensional construct over hundreds of skills.

The ALEKS System and Course Set-up

The ALEKS system offers web-based academic courses in mathematics, chemistry, statistics, and accounting for K-12, Higher Education, and independent use. Upon enrolling in an ALEKS course, students takes an initial assessment that determines their state in the course (e.g., ALEKS Algebra 1). Unlike other traditional assessments that provide a single number score, ALEKS uses an adaptive assessment to measure students’ knowledge by identifying the set of items they know (i.e., state), the set of items they don’t know, and the set of items they are ready to learn (i.e., outer fringe). These sets are determined from the entire set of items in the course, which typically consists of anywhere from 300 to 600 items (Cosyn et al., 2021). After taking the initial assessment, ALEKS guides students through an individualized learning path governed by the knowledge structure. In this experience, students choose an item from their outer fringe and practice several instances of the item. If the student performs well enough, the item is added to the student’s state, upon which the student’s outer fringe is updated. ALEKS will also periodically give students progress assessments and update their states based on their performance. As students progress through the course by adding more and more items to their state, their ALEKS “course pie” is filled (see Fig. 1).

Fig. 1
figure 1

An example of a student’s ALEKS course pie

On the instructor side, ALEKS provides teachers with a learning management system to support various aspects of class administration and instruction, including the ability to customize course content, build intermediate learning objectives called modules, and monitor overall class and individual student progress. The set of available items in an ALEKS course is organized in a table of contents divided into sections and subsections by content area (see Fig. 2). Each section corresponds to a slice of the ALEKS pie, which roughly covers the amount of material that a typical textbook chapter covers. Each slice is divided into subsections (or subslices), which covers roughly the equivalent of a chapter-section in a typical textbook, and each subslice consists of a set of items.

Fig. 2
figure 2

An example of an ALEKS course table of contents

There are also several ways teachers can set up their courses as it relates to how students work through the content. Teachers have the option of setting up modules, which are assigned item sets that may be given due dates or a percentage level of mastery that students must attain before moving on to the next module. When modules are used, students work on items that are both in their outer fringe and in the current module. If no such item exists, this would indicate that the outer fringe is either above or below the level of the module. If students are not ready to learn any item in the module, they must first work on items in their outer fringe that are prerequisites to other items in the module. Teachers may also choose not to use modules in their ALEKS class, which is more common than classes with modules among all K-12 classes in ALEKS. If the teacher does not set up the class with modules, students have no restrictions on what items they can work on other than what is determined by their outer fringe. In essence, this configuration allows students to work at their own pace within the system, which is often used for remediation and/or skill acquisition in a blended learning model. Lastly, it is worth noting, when modules are used, the outer fringe of a student may simultaneously contain items in the module as well as prerequisite items for the module.

Overall Approach

In this study, we set out to devise grouping algorithms that teachers may implement with ALEKS. In Section The ALEKS System and Course Set-up, we mentioned that teachers have the option of setting up their ALEKS class with or without modules. So, in devising our grouping algorithms, we wanted methods that would be compatible with both of these possible ALEKS class configurations. We call the first grouping algorithm within-module grouping method, as it is intended for ALEKS classes whereby the teacher has already pre-defined ALEKS modules for the class to complete. We call the second grouping algorithm curriculum-wide grouping method, as it is intended for ALEKS classes where there are no pre-determined modules, thus allowing students to work on ready-to-learn items from anywhere in the curriculum. This means students are more likely to be working on content that is spread out over the entire curriculum. Given this, the curriculum-wide grouping method places a greater emphasis than the within-module grouping method on identifying appropriate content for groups to focus on. Nonetheless, both methods are designed to form homogenous groups and are meant to help teachers administer appropriate instruction to groups of students who share a similar level of readiness. We also wanted to devise a grouping method that could potentially support collaborative learning practices with ALEKS. For this, we devised the reciprocal pairing method, which aims to support Reciprocal Peer Tutoring (RPT) to elicit more fruitful collaborative exchanges between students. The goal of this method is to identify pairs for which each student in the pair possesses some specific knowledge that the other does not. More specifically, because we hope to facilitate fruitful dialogue between peers, we aim to identify pockets of content for which the tutor is knowledgeable and the tutee is ready to learn so that the exchange likely occurs in the tutee’s Zone of Proximal Development. Because this requires identifying a specific pocket of content, this pairing method is most applicable for classes with no modules so that the algorithm is not constrained by any particular module (or set of items).

For each of the three scenarios, we will evaluate our grouping methods against two alternative methods that are adapted to each scenario. The first alternative method is designed to group (or pair) students randomly. The second alternative method is designed to group (or pair) students based on student ability as determined by the student’s current score, which we define as the number of items in the student’s state. Although the student’s current score is updated as the student works in ALEKS, it is worth noting that this measure takes on a unidimensional view of ability at any given time as it is simply a single number score to represent overall course knowledge. We chose these two alternative methods for their plausibility of being implemented in practice by teachers, as we saw in the related works presented in Section Related Works. Moreover, it is important to mention, that for the curriculum-wide and pairing scenarios, the groups that are formed randomly and by score are evaluated on content that is identified by the devised algorithms, namely, the within-module grouping method and the reciprocal pairing method, respectively. This is important because we will evaluate against a set of metrics that rely on the content chosen. Specifically, depending on what content we consider, the evaluation could be drastically different on different sets of items across the curriculum. Therefore, algorithms that are simply meant to group students with no regard to the content, do not allow these alternative methods to be fairly evaluated.

To evaluate our methods, we use data from three ALEKS Algebra 1 classes (one for each of the three methods). These classes were selected from a random sample of 32 ALEKS Algebra 1 classes between the years 2017 and 2019, whose data was provided to us by ALEKS. We were only provided with ALEKS assessment and interaction data, so we do not have meta-data about the classes or the students taking these classes, but ALEKS Algebra 1 is typically used in high school courses. The process for selecting the three classes we used from the pool of 32 classes is described in Appendix A.

The three subsequent sections are modular by design. Each section may be independently reviewed without any prior familiarity of the other two. In each section, we first describe the algorithm for the method of focus, including the goal and rationale behind its specifications. Second, we describe our evaluation process against each of the two alternative methods, including metrics used in our evaluation and provide context as to why such metrics would be of interest in practice. Third, we summarize our findings and conclude with key insights that emerge from our evaluation process. Before we proceed, it should be mentioned that when we speak of groups of students for the remainder of this paper, we are speaking of mock groupings. That is, when we evaluate groups formed from one method against groups formed from another method, such groups were not actually created with real students. Therefore, this study does not account for the influence of human interaction, which we fully recognize is an important aspect of collaborative learning in small groups. This study does not intend to make any claims on what actually happens as a result of the groupings (e.g. type of interactions, learning, groups staying the same, etc.). This sort of evaluation is beyond the scope of this paper and is left for future work.

Within-Module Grouping Method

Method Specifications

This grouping method is designed for ALEKS classes with modules. The goal of the method is to form groups of students who are at the same level of readiness with respect to a module. For example, consider a module on solving linear equations. A student who only has “one-step equations with whole number coefficients” in their outer fringe probably should not be grouped with a student who only has “multi-step linear equations with fractional coefficients” in their outer fringe, as these two students would not be at the same level of readiness with respect to the module. This presents a highly nuanced situation which we aim to account for in our grouping method.

ALEKS keeps track of a student’s state \(K\) as the student progresses through the course. Specifically, within a given module, ALEKS keeps track of what the student still needs to learn in the module. We apply the k-means algorithm to cluster students based on what students still need to learn in order to complete the module. We denote the set of all items in the course as \(Q\). A student’s state uniquely defines everything that is outside the student’s state, \(Q\backslash K\). These are the set of items in the course that the student still needs to master. For classes with modules and due dates, students must complete all out-of-state items in the module. If any of these are not in the student’s outer fringe, the student must also complete any prerequisite item needed for the module. We denote \(M\) as the set of all items in a module and all of its prerequisites. So, \(M\backslash K\) is the set of all items a student needs to learn to complete the module. We then use this set to build a \(|Q|\)-dimensional student feature vector, where each feature indicates (with a 0 or 1) whether the item with the given index in the course is a member of \(M\backslash K\). We create this feature vector for every student in the class. The idea is to apply the k-means algorithm to these vectors to find clusters of students who have a large overlap in their respective set \(M\backslash K\). However, depending on a variety of factors such as the relative sizes of the module, number of prerequisites and the knowledge structure itself, it is possible for two students to have a relatively large overlap in their respective sets \(M\backslash K\), but have very different outer fringes with respect to the module. For example, imagine a module with five items and two prerequisites that are not a part of the module. Suppose student A’s outer fringe contains the two prerequisites (and none of the module items) and student B’s outer fringe contains all five module items (and none of the prerequisites). In this case, the sets \(M\backslash K\) for the two students overlap substantially in five items. However, student A is not immediately ready for the module as there is no module item in students A’s outer fringe, while student B is ready for the module as all module items are in student B’s outer fringe. In practice, it would not be good for these students to be in the same homogeneous group since they have no overlap in their outer fringe, which may make it difficult for a teacher to give appropriate instruction to each group member. To help eliminate this kind of scenario, we add a feature that gives the number of module items in the student’s outer fringe. This gives us student vectors with \(\left|Q\right|+1\) features. Lastly, we apply k-means to this set of vectors with \(k=5\) to form our groups. The rationale for our choice of \(k\) is provided in Appendix B.

Evaluation

We evaluated the within-module grouping method against two alternative methods using metrics that measure the degree of overlap of outer fringes among students in each group. These metrics are described as follows:

  1. (1)

    Proportion of pairs that share items: This is the proportion of all possible pairs of students in the group that share at least one item in their outer fringe with respect to the module. Ideally in practice, we want this to be 1, so that teachers can more easily target their instruction. In the case where the proportion is not 1, this means there is a student in the group who may need differentiation apart from the group, which is not ideal.

  2. (2)

    Minimum number of items shared: This computes the number of shared items in the outer fringe (with respect to the module) for each possible pair in the group and takes the minimum of all such values. Intuitively, this identifies the “weakest” pair-link in the group. Ideally in practice, we want this value to be as large as possible.

  3. (3)

    Average number of items shared: This computes the number of shared items in the outer fringe (with respect to the module) for each possible pair in the group and takes the arithmetic average of all such values. Ideally in practice, we want this as large as possible.

In addition to the metrics above, we also computed the averages of (1) – (3) across all \(k\) groups. While these metrics are reasonably important to consider in practice, we recognize there could be other metrics one might want to consider when evaluating the degree of overlap. We do not claim these are necessarily the best metrics for all use cases, only that they are sensible metrics to consider for the given context.

For our evaluation, we decided to form groups at the start of each module to capture different moments in time during the school year. We acknowledge that in a real situation, it is reasonable to suspect that groups formed after the first module would likely be influenced by the prior groupings, due to the influence of human interaction. Nonetheless, the purpose of this study is simply to evaluate mock groupings at different points in time for which a teacher could have applied the devised method. In practice, teachers could form groups at any point in time, but the start of a module seems like a sensible choice. The class we analyzed for this method (Class 1) had a total of 15 modules during the school year. We removed the first module from analysis, as it is reasonable for teachers to acclimate students with the program at the beginning of the year. We also removed three review modules that had 3, 5, and 10 items, respectively, resulting in a total of 11 modules for which we compared the different grouping methods.

Unlike the within-module grouping method, which employs a k-means algorithm that may produce different size clusters, we fixed the number of students in each group for both the random and score methods. The class used in this analysis (Class 1) had 27 students, so we fixed the size of each group to be 5 to establish reasonable size groupings. For evaluation purposes, the random method operates as follows. We generated all possible groups of size 5 from the available class roster at the start of each module. For most modules this would mean that there were \(\left(\begin{array}{c}27\\ 5\end{array}\right)=\mathrm{80,730}\) possible groups, however, some students were enrolled late, giving a class roster ranging from 24 to 27 students depending on the module. The averages for the evaluation metrics (1) – (3) were then computed, giving an overall expectation of these measures for a randomly selected group of size 5. For the score method, on each module, we partitioned the class by score, giving a deterministic set of groups on each module. This was done by first grouping students with the top 5 scores, then grouping students with the next 5 highest scores until each student had been placed in a group. If 1 student remained, the student was placed in the final group. If there were 2–4 students remaining, this would define a new group. Table 1 shows the results of (1) – (3) on each group for the within-module grouping method as well as the averages of these measures across all groups for each of the 11 modules. The final two columns show the averages of (1) – (3) for the random method and score method.

Table 1 The three methods and measures of (1) – (3) on each module

Lastly, to provide a relative comparison among the methods, we used the distribution of all possible groups of size 5 as a baseline. For the within-module and score methods, we include the percentiles of the averages of (2) and (3) with respect to the distribution of all possible groups of size 5. Though the groups formed by the within-module and score methods do not come from the same population as our baseline distribution, these percentiles provide us with a proxy of how the average measures of (2) and (3) for the within-module and score methods compare to a randomly chosen group of size 5. These percentiles are given in parentheses in Table 1.

Results

In Table 1, our results show that the within-module grouping method obtained more favorable measures than the random and score methods on each module on average. The within-module grouping method produced greater (or equal) averages than the random method on metrics (1) – (3) for each module. The same was almost true when compared to the score method. The score method obtained equal or greater measurements on only two occasions (in bold). We also observe that the percentiles for the averages of (2) and (3) for the within-module grouping method were almost always above the 90th percentile among the distribution of all possible groups of size 5, and above the 95th percentile half of the time. The score method achieved at least an 80th percentile the majority of the time. We also observe that the within-module grouping method obtained measures that could be considered ideal over most of the groups formed across all 11 modules. Of the 55 groups created, there were only 6 groups formed (indicated by the shaded cells) where there existed a pair of students that did not share any item in their outer fringe with respect to the module.

Another result observed was that students in groups formed by the within-module grouping method tended to have similar scores, although it was common for more than one group to have roughly the same distribution of scores. Despite this, throughout every module there tended to be a handful of students who would appear to be out of place when just considering scores. For example, in “Module 3 Exponents,” group 3 had nine students. One of these students had an outlier score of 41, while the other eight scores ranged from 64 to 81 with an average of 70.4. The student with score 41 had 7 items in their outer fringe in the module. This is striking considering among the other eight students, the average number of items in the outer fringe in the module was 7.6. Moreover, the student with the highest score of 81 had exactly 7 items in their outer fringe in the module. In other words, despite having an outlier score well below groupmates, the student with score 41 was at the group’s level of readiness with respect to the module. It is worth noting, however, that the student with a score of 41 had 5 prerequisite items to complete before completing the entire module. The other eight students had anywhere from 0 to 3 prerequisites to complete.

This example highlights two takeaways. First, the within-module grouping method shows the capability of handling a nuanced situation by grouping a student who may have a lower overall ability, but who is ready to learn with groupmates on a specific set of items. Second, because students may display different overall ability and may need to complete additional prerequisites, teachers may want to regroup students more frequently. This suggests that teachers may want to take advantage of the flexible grouping nature of this method by regrouping mid-module to check if grouped students are moving through the module at roughly the same rate. This would be especially useful if modules are scheduled to last several weeks.

Lastly, we applied an internal validation measure to evaluate the goodness of the clusters formed in the analysis above. For this, we applied a silhouette analysis, which measures how well observations are clustered with other observations similar to themselves (Rousseeuw, 1987). For each clustering experiment (i.e., on each module), we computed silhouette coefficients for each student, which range from –1 (indicating that the student was incorrectly clustered) to 1 (indicating that the student was appropriately clustered). We then computed the average silhouette coefficients for each clustering experiment and obtained a distribution of averages. This distribution ranged from 0.20 to 0.55 with a mean of 0.34 and standard deviation of 0.10. These results indicate that on average, students clustered by the within-module grouping method had higher similarity to other students in their cluster than to students in any other cluster formed.

Insights

The within-module grouping method uses k-means on student feature vectors that contain information on a student’s knowledge over a large set of items. By doing so, we took a fine-grained multidimensional view of student ability to form homogenous groups. As a result of taking this view, we saw that students may be grouped with other students with considerably different scores, and with good reason as they were ready to learn along with the group on roughly the same set of items. If these results generalize, this would seem to suggest the need for rethinking the widely applied (as well as criticized) practice of tracking students into ability groups for course taking in schools. Critics of tracking have pointed out several problematic issues as the result of the practice with regard to social and racial discrimination (Cipriano-Walter, 2015; Gallardo, 1994), teacher appointment (Finley, 1984; Kelly, 2004), and curriculum expectations (Oakes, 2005). Perhaps if current notions of low and high ability are adjusted towards a more fine-grained view, student knowledge can still be taken into consideration when grouping, but in a way that recognizes students’ individual strengths, particularly for those who may be perceived as low ability.

Curriculum-Wide Grouping Method

Method Specifications

This grouping method is designed for ALEKS classes with no modules, which is the most typical set-up among all K-12 classes in ALEKS. Classes with no modules are meant to allow students to work at their own pace on different parts of the curriculum. This method is designed to group students of similar levels of readiness with respect to an appropriate set of content. Therefore, the goal of this method is not only to group students of similar readiness, but it also aims to identify the appropriate content for the group to focus on (e.g., for teacher-led small group instruction). Note that this goal poses a significantly different task from the within-module grouping method, which already has a pre-defined module (or piece of content) from which student readiness is determined. In order to identify an appropriate set of content, we leveraged the ALEKS course table of contents, which is organized into slices and subslices. Slices and subslices can be thought of as modules, particularly because of how the items are organized and sequenced in a logical order as a teacher might do with modules. The specifications for the curriculum-wide grouping method are described as follows.

The only parameter that this algorithm takes is the size of each group, \(n\). If there are \(N\) total students in the class, there are \(\left(\begin{array}{c}N\\ n\end{array}\right)\) possible groups that can be made. For every possible group, we perform the following procedure. For a given point in time, we create a matrix where each entry gives the number of items in a student’s outer fringe corresponding to a particular subslice in the ALEKS course. The rows represent every student in the group (size \(n\)), and the columns represent every subslice in the course table of contents (size \(s\)). Then, for every column (subslice), we find the minimum value. This essentially gives a worst-case scenario on every subslice in terms of how many items the group is ready to learn from the subslice. In other words, if this number is \(t\) for a given subslice, we can say that overall, the group has at least \(t\) items ready to learn on the given subslice. Next, we want to identify the “best” subslice from these worst-case scenarios by finding the maximum across all columns of these minimums. Suppose \({t}_{1}, {t}_{2}, {t}_{3}\dots ,{t}_{s}\) are the minimum number of items in the outer fringes for the group in each subslice. The maximum of these minimums may be simultaneously achieved by multiple subslices. Therefore, we may get \(\mathrm{max}\left({t}_{1}, {t}_{2}, {t}_{3}\dots ,{t}_{s}\right)={t}_{i}=\dots ={t}_{j}\). For such cases, we use a tie breaker so that we only identify one subslice. To do this, we rank each subslice in the course according to how many postrequisites the subslice has. The rationale for this is that we want to identify the most critical subslice for future success. For example, a subslice on scientific notation is mostly self-contained in the curriculum, whereas orders of operations is critical for many other concepts in the curriculum (e.g., evaluating and simplifying algebraic expressions, solving equations and inequalities, etc.). Therefore, suppose that the subslice with index \(i\) has the highest rank. In this situation, we would pick the subslice with index \(i\) for the group. Let \({t}_{1}^{*}={t}_{i},\) which gives the number of items in the outer fringe corresponding to the subslice chosen for the 1st group out of all \(\left(\begin{array}{c}N\\ n\end{array}\right)\) possible groups.

The process described above is performed on each of the \(\left(\begin{array}{c}N\\ n\end{array}\right)\) possible groups. This will produce the values \({t}_{1}^{*}, {t}_{2}^{*}, \dots ,{t}_{\left(\begin{array}{c}N\\ n\end{array}\right)}^{*}\), each identifying a subslice for a particular group of \(n\) students. We now need to determine the actual set of suggested groupings from this set of all possible groups. To pick our first group of size \(n\), we find \(\mathrm{max}\left({t}_{1}^{*}, {t}_{2}^{*}, \dots ,{t}_{\left(\begin{array}{c}N\\ n\end{array}\right)}^{*}\right)\). As before, this could correspond to multiple subslices, in which case we again identify the subslice with the most postrequisites. Once the first group of size \(n\) is selected, we repeat the process on the remaining \(\left(\begin{array}{c}N-n\\ n\end{array}\right)\) possible groups. We continue with this process until all students are placed in a group. In the case that \(n\) does not divide \(N\), the last group will consist of \(n+1\) students, if N ÷ n produces a remainder of 1. If N ÷ n produces a remainder of 2 or greater, the remaining students will be placed in a new group of size less than \(n\).

Evaluation

We evaluated the curriculum-wide grouping method against two alternative methods: one that grouped students randomly and another that grouped students by score. To do this, we used metrics that measure the number of items in the outer fringe of the students in each group with respect to the identified subslice (as detailed in Section Method Specifications), as well as the degree of overlap of outer fringes among the students in the group. These metrics are described as follows:

  1. (1)

    Minimum number of items in group’s outer fringes: This is the minimum number of items ready to learn in the identified subslice for the group. This is the value that is maximized in Section Method Specifications. In a real situation, it makes sense to want this value to be as large a possible since it gives the teacher the most flexibility to create a lesson plan related to the subslice.

  2. (2)

    Minimum number of items shared: This computes the number of shared items in the outer fringe (with respect to the subslice) for each possible pair in the group and takes the minimum of all such values. Intuitively, this identifies the “weakest” pair-link in the group. Ideally in practice, we want this value to be as large as possible.

  3. (3)

    Average number of items shared: This computes the number of shared items in the outer fringe (with respect to the subslice) for each possible pair in the group and takes the arithmetic average of all such values. Ideally in practice, we want this as large as possible.

As we did with the curriculum-wide grouping method, “good” subslices are identified for each group formed randomly or by score, so that these methods would have a fair chance of scoring well on the aforementioned metrics. The process for identifying this “good” subslice is the same process outlined in Section Method Specifications. For this analysis we used Class 2, which consists of 30 students. For each method, the group size is \(n=6\) for a total of 5 groups. Also, because there are no modules with due dates, we defined 11 checkpoints that are evenly spaced throughout the year to mirror the number of modules observed in Section Evaluation, serving as moments in time a teacher could have formed groups. In addition to the metrics above, we also computed the averages of (1) – (3) across all groups formed for each checkpoint. We also computed the percentiles of the averages of (1) – (3) for the curriculum-wide and score methods with respect to the distribution of all possible groups of size 6. Note that the groups formed using the curriculum-wide and score methods are indeed members of the distribution of all possible groups of size 6, which makes the percentiles an apt measure for comparisons.

Finally, it is worth noting that we initially performed our analysis by identifying the subslice for each group from among all possible subslices covered in the class. We noticed however that for many of the checkpoints this led to multiple groups (as many as four) being assigned to the subslice “Geometry”, which is a review subslice on concepts related to perimeter, area, volume, and surface area. This means that students had a relatively large amount of geometry items in their outer fringe, which they likely did not work on for a while as the course progressed. Since these skills are likely not to be a focus of an Algebra 1 class, we removed “Geometry” from the possible subslices when forming our groups for each of the three methods. We note that the quantitative evaluation results are quite similar with and without the inclusion of the Geometry subslice. More broadly, this demonstrates how our method can easily be customized, allowing teachers to specify a subset of subslices for the method to choose from, which is likely a feature teachers would want in practice.

Results

The results in Table 2 show that the curriculum-wide grouping method obtained more favorable measures than both the random method and score method on almost all checkpoints and metrics. For average measures on (1) – (3), there were only three occurrences in which the score method produced more favorable values than the curriculum-wide grouping method. These three occurrences occurred in the first two checkpoints (marked in bold in Table 2). Of the 55 groups formed by the curriculum-wide grouping method, there were only four groups formed (indicated by the shaded cells) that had a pair of students who did not share an item in the chosen subslice. We also observe that the percentiles for the averages of (1) – (3) from the curriculum-wide grouping method were nearly always above the 80th percentile and above 90th percentile a little over two-thirds of the time. Considering the curriculum-wide grouping method is designed to form groups such that the minimum number of items ready to learn in a particular subslice is maximized, it not surprising that the results to metric (1) were better than the other metrics. Specifically, the percentiles for the average of (1) is above the 90th percentile all the time and nearly always above the 95th percentile. In contrast, the score method never achieves a percentile above the 90th percentile.

Table 2 The three methods and measures of (1) – (3) on each checkpoint

The identified subslices throughout the course are listed in Table 3. Qualitatively, we observe that our method identified a variety of subslices for groups to focus on both within a given time slice and across time. That is, our method seems to identify that (a) different topic areas are needed for different groups of students at any given point in time, and (b) the content that students should focus on changes over time as students progress in their learning throughout the course. This suggests that making content recommendations is just as critical as forming the groups themselves.

Table 3 Identified subslices on each checkpoint

Insights

The curriculum-wide grouping method tended to group students with a wide range of overall scores. This may be counterintuitive considering our method is designed to form homogenous groups in terms of their readiness on certain items. Specifically, 48 of the 55 groups (87%) had a mixture of students who were below the class median in overall score and who were above the class median in overall score. This is in sharp contrast to the score method, which, by design, only has one group in each checkpoint (20% overall) with a mixture of students falling below and above the class median in overall score. Moreover, this one group contains precisely the students with scores in the 40th-60th percentiles, which is likely to have a smaller range in overall score than most of the groups formed by the curriculum-wide grouping method. These findings highlight a subtle, yet interesting phenomenon. While the curriculum-wide grouping method is designed to form homogenous groups, where students in a group are ready to learn the same items, it also simultaneously forms heterogeneous groups with respect to overall score, which takes on a much coarser view of ability. Therefore, depending on the particular view of ability—whether we consider ability as a multidimensional measure of readiness on a variety of items or whether we consider ability as overall score—the groups formed by the curriculum-wide grouping method possess qualities of both homogeneity and heterogeneity.

Reciprocal Pairing Method

Method Specifications

This method is designed to elicit more fruitful interactions between student pairs as they work collaboratively in a Reciprocal Peer Tutoring (RPT) format. In this format, students in a pair interchange tutor and tutee roles. For learning to operate optimally in this format, it would be best to find pairs of students who have complementary knowledge so that each student in a pair can contribute meaningfully in a peer-to-peer learning exchange. In the context of ALEKS, complementary knowledge between a pair of students is when each student has something in their state that the other does not. The goal of this pairing method is to find pockets of content (e.g., subslices) for which the tutor is knowledgeable and the tutee is ready to learn so that the exchange occurs in tutee’s Zone of Proximal Development. Before describing the reciprocal pairing method, which will involve the course subslices, first let us describe a systematic process that finds complementary pairs without considering subslices. Although we do not use this process directly, it helps in describing our reciprocal pairing method. The process is described in three steps (i) – (iii) as follows.

  1. (i)

    For a particular student \(\alpha\), let \({K}_{\alpha }\) and \({F}_{\alpha }\) denote the student’s state and outer fringe, respectively. For every possible pair of students \(i\) and \(j\) in the class, we want to consider the number of items in the intersection of student \(i\)’s outer fringe and student \(j\)’s state, \(\left|{F}_{i}\cap {K}_{j}\right|\). Likewise, we want to consider the number of items in the intersection of student \(i\)’s state and student \(j\)’s outer fringe, \(\left|{K}_{i}\cap {F}_{j}\right|\). Conceptually, \(\left|{F}_{i}\cap {K}_{j}\right|\) is the number of items student \(j\) can help student \(i\) learn, and \(\left|{K}_{i}\cap {F}_{j}\right|\) is the number of items student \(i\) can help student \(j\) learn. These two numbers are stored in an \(N\times N\) matrix, where \(N\) is the number of students in the class. Figure 3 shows an illustration of this matrix.

  2. (ii)

    Next, for each student \(i\) (or row), we find the smaller of the two values for every tuple (shown in Fig. 4). By considering the smaller value of each tuple, we are essentially finding the minimum number of items for which the pair of students can engage in an RPT exchange. Ideally, we want this number to be as large as possible when considering all tuples in a given row. Therefore, we rank these minimum values in descending order, resulting in a ranking from 1 to \(N\). A rank of 1 represents the largest minimum among all tuples in a row and a rank of \(N\) represents the smallest minimum among all tuples in a row. Note we do not want to consider the larger value or the average of the two values of each tuple, because then we would lose sight of the other number in the tuple, which could be 0 (or something small). We show an example of this scenario in Fig. 4 with the student pair \(\{i, 1\}\). In this example, while there are 36 items that student 1 can teach student \(i\), there are 0 items student \(i\) can teach student 1. Because this kind of pairing is likely to result in a one-sided exchange, our aim is to avoid this situation as best as we can.

  3. (iii)

    We use the rankings to greedily pair each student in the class. We do this by first randomly choosing a student from the class, say student \(i\). Then, we pair student \(i\) with the student who gives the highest rank. In Fig. 4, this refers to the tuple (19, 16), which forms the pair \(\left\{i,j\right\}\). We then select another student at random (excluding students \(i\) and \(j\)) and continue to find a partner using the highest rank. This method applies a greedy approach since we are not entirely in control of the quality of the pairing in each successive pair made. For example, student \(j\) could have been top ranked for multiple students in the class. However, since student \(j\) is paired early (with student \(i\)), student \(j\) is no longer available to be partnered with anyone else and therefore the next best ranked student must be taken. A drawback of using a greedy approach is that we can still produce a tuple containing 0 since we are greedily eliminating top ranked partners. Note, we elect to apply a greedy algorithm as opposed to finding the optimal pairing for the entire class since this kind of optimization problem could be computationally expensive even if there existed an elegant solution. To provide some perspective, if this was done in a brute force manner, there would be over a nonillion (1030) possible pair configurations for a class of just 20 students!

Fig. 3
figure 3

An illustration of a matrix of tuples containing the number of items shared in the state and outer fringe for every possible student pair in the class. aRows and columns indicate the ith and ith student. bThe tuple contains the number of items in the intersection of student i’s outer fringe and student j’s state, and the number of items in the intersection of student i’s state and student j’s outer fringe

Fig. 4
figure 4

An illustration showing an example of ranked tuples for student \(i\). a Minimum values in blue indicate the minimum number of items for which the pair can engage in an RPT session. b The highest ranked tuple indicates the “best” student pair to engage in an RPT session

So far in describing the process above, we have not considered the actual items in the sets \({F}_{i}\cap {K}_{j}\) and \({K}_{i}\cap {F}_{j}\). Indeed, the items can be dispersed throughout the curriculum. Therefore, simply pairing students in the manner described above is not advised as the lack of content focus could lead to poor interactions. This leads us to the other goal of our reciprocal pairing method, which is to identify two subslices whereby one student is more experienced than the other. We describe the specifications of the reciprocal pairing method in three steps, (a) – (c).

  1. (a)

    As before, we compute the tuple for \(\left(\left|{F}_{i}\cap {K}_{j}\right| , \left|{K}_{i}\cap {F}_{j}\right|\right)\) for every possible student pair\(\left\{i,j\right\}\), but this time we do so on every subslice. That is, we find the tuples,

    $$\left(\left|{F}_{i1}\cap {K}_{j1}\right| , \left|{K}_{i1}\cap {F}_{j1}\right| \right), \left( \left|{F}_{i2}\cap {K}_{j2}\right| , \left|{K}_{i2}\cap {F}_{j2}\right| \right), \dots , \left( \left|{F}_{is}\cap {K}_{js}\right| , \left|{K}_{is}\cap {F}_{js}\right|\right),$$

    where \(s\) is the number of subslices in the ALEKS course table of contents, \({F}_{\alpha k}\) is student \(\alpha\)’s outer fringe intersected with subslice having index \(k\), and \({K}_{\alpha k}\) is student \(\alpha\)’s state intersected with subslice having index \(k\).

  2. (b)

    Then, for each \(\left\{i,j\right\}\) pair, we find the largest value for both the first and second elements in the tuple over all \(s\) tuples in the \({ij}^{th}\) position. These values are given by \(\underset{k}{\mathrm{max}}\left\{\left|{F}_{ik}\cap {K}_{jk}\right|\right\}\) and \(\underset{k}{\mathrm{max}}\left\{\left|{K}_{ik}\cap {F}_{jk}\right|\right\}\), respectively. These values are then used to form another tuple that gets stored in another \(N\times N\) matrix.

  3. (c)

    From this matrix, we then proceed to pair the students as described in steps (ii) – (iii). If there are an odd number of students, the last student is paired with the student with highest rank, as described in (iii), forming a group of three.

Evaluation

We evaluated the reciprocal pairing method against two alternative methods: one that pairs students randomly (random method) and another that pairs students by closest score (score method). The score method applied a greedy approach whereby we started by selecting one student at random from the class and paired this student to the student with the closest score. We then repeated this process until we paired every student in the class. Odd numbers of students are handled the same way as the reciprocal pairing method. To evaluate all methods on the same playing field, we paired students on the “best” subslices as determined by (a) – (c) in Section Method Specifications. Also, because the pairs formed in all three methods depend heavily on random selections of the first student in the pair, we evaluated over 1,000 runs on each method. We then computed measurements over all runs to make our comparisons. The metrics we used are given in Table 4. We performed our analysis over three different timestamps: one at roughly the beginning of the class (2018–10-07), another roughly midway (2019–01-16), and one near the end of the class (2019–03-24). These results are shown in Table 5.

Table 4 Metrics for comparing the three parings methods: reciprocal, random, and score
Table 5 Measures involving the three pairing methods over 1,000 runs at three different timestamps

Results

From Table 5, we observe that the reciprocal pairing method achieves more favorable measurements on almost all metrics across the three timestamps. More importantly, this method rarely produced tuples with a 0. In practice, this is important because we never want to generate a tuple pair with a 0 since this would violate the RPT framework where each student should have something to impart to their partner. Moreover, the reciprocal pairing method always achieves a value of 2 for the largest minimum tuple value over all runs. In practice, this may arguably be the most important metric and result, because it indicates that we can find a set of pairings such that each peer “tutor” has knowledge of at least two items in a given subslice. On the other hand, we cannot be certain that the score method can achieve a minimum value of 2 in the first two timestamps, as we were unable to achieve this over 1,000 runs. We do not claim that the score method cannot achieve a largest minimum of 2 or greater in the first two timestamps; however, if it is possible, it may require many more runs (i.e., more computational time). In practice, one could imagine a pairing algorithm tool that executes \(R\) number of runs in the background and selects the “best” configuration for the teacher. One way to define “best” is by picking a run that has the largest minimum tuple value in order to ensure all students can contribute meaningfully.

Although the reciprocal pairing method achieved more favorable measures than the score method across the different metrics, the measures for the score method do appear to be somewhat comparable to the reciprocal pairing method. This may not be particularly surprising considering one might expect students to have roughly the same overall level of ability to be able to tutor each other. The score method may be good enough in practice; however, recall that the algorithm still needs to determine appropriate subslices for the students to engage in an RPT session, which all of our pairing methods do. Therefore, even if a teacher were to use our score method, it would still be more sophisticated than traditional methods for RPT in that the score method identifies appropriate content for reciprocal tutoring. Lastly, earlier in Section Peer Tutoring, we had noted that a common practice for studying RPT is to pair students randomly. We found that the reciprocal pairing method was able to find pairs containing a higher number of items in their complementary knowledge than in the random method. The average tuple value was greater for the reciprocal method than in the random method in each timestamp with the difference between the two falling in a 95% confidence interval (over the 1,000 runs) of [0.275, 0.475], [0.305, 0.527], and [0.315, 0.531] respectively.

While our method identified items such that for each pair each student knows something that the other does not, this does not necessarily mean that each student is relatively more knowledgeable than their peer on the associated subslice. For example, student A might have learned a particular item in the subslice “Graphs of Functions” that Student B has not yet learned, but Student B might generally know more about “Graphs of Functions” than student A. In that case, it might not make so much sense to have Student A tutor Student B on “Graphs of Functions”. To see if this was the case, we examined one specific pairing configuration. Table 6 shows what a pairing configuration might have looked like on March 24, 2019, using the reciprocal paring method. In the first row, first column, (11, 17) indicates that students 11 and 17 were paired. Under the column “State and Fringe Tuples on Subslice”, we see the tuple (3, 5), and under column “Subslice (X; Y) Titles”, we see (Polynomial Multiplication; Graphs of Functions). This means that student 17 can potentially teach 3 items from the subslice on Polynomial Multiplication to student 11, and student 11 can potentially teach 5 items from the subslice on Graphs of Functions to student 17. Therefore, we may expect that student 17 has more knowledge overall on the subslice of Polynomial Multiplication and that student 11 has more knowledge overall on the subslice of Graphs of Functions. To check if this is the case, we provide the columns titled “Number of items in state of (A, B) on Subslice X”, which has the tuple (4, 7) in the first row, and “Number of items in state of (A, B) on Subslice Y”, which has the (9, 2) in the first row. The tuple (4, 7) indicates that student 11 has 4 items on Polynomial Multiplication in their state and student 17 has 7 items on Polynomial Multiplication in their state. Therefore, we confirm that student 17 has more knowledge overall on Polynomial Multiplication. Likewise, the tuple (9, 2) indicates that student 11 has 9 items on Graphs of Functions in their state and student 17 has 2 items on Graphs of Functions in their state. So, we confirm that student 11 has more knowledge overall on Graphs of Functions. If we look down these two columns, we notice that this trend holds for every pair except the student pair (25, 10), where the two students share the same number of items in the subslice on One-Step Linear Equations even though student 25 is identified as a tutor for that subslice. Even then, student 25 does not know fewer items in the subslice than student 10. We examined several other configurations and noticed the same trend holds in just about every pair. Indeed, we noticed it was relatively easy to find a configuration where this property held for all pairs. Therefore, we argue that the reciprocal pairing method is not only a tool to support RPT, but it may also serve as a tool for teachers to quickly identify unique strengths of students and where these strengths can be leveraged to help fellow classmates.

Table 6 A class pairing configuration formed by the reciprocal pairing method

Insights

For both the reciprocal tutoring method and the score method, our results indicate that at least for the classroom we evaluated, every student always has something to offer some other student. Beyond presenting a method for forming such groups, showing students that they are a relative “expert” in some area of the curriculum (however small it may be) could be motivating and result in improving students’ self-confidence in mathematics. On the other hand, grouping students randomly may result in pairs where students may have less to offer in a peer-tutoring exchange, which could be demotivating. Even if a teacher wants to group students randomly, we present a way to use ALEKS data to automatically determine content that each pair could focus on, rather than having students identify their relative strengths, as is the case in existing work on RPT.

Discussion

In this paper, we introduced three grouping algorithms that are designed to assist teachers to leverage an intelligent tutoring system (i.e., ALEKS) to more effectively implement common pedagogical practices, such as small group differentiated instruction and Reciprocal Peer Tutoring (RPT). To do this, we leveraged the ALEKS artificial intelligence, which adopts a fine-grained multidimensional view of ability through the lens of Knowledge Space Theory (KST). This approach sets our methods apart from prior work on group formation, which typically use a single or small set of assessment scores to gauge ability for the purpose of grouping students (Christodoulopoulos & Papanikolaou, 2007; Connor et al., 2007; Manske et al., 2015). We showed examples that suggest our devised methods are capable of highlighting student strengths and weaknesses across hundreds of skills to form more nuanced groupings. In addition, we also addressed an implementation challenge faced by teachers using the ALEKS software, which is that teachers often notice a wide disparity between where students are in the system and the curriculum taught in class (Phillips et al., 2020). The methods address this by not only forming groups of students, but they also identify deliberate pockets of items that each group can focus on. We imagine this could potentially help teachers save time in forming groups as well as identifying the content that is most appropriate for differentiated instruction and RPT sessions. As mentioned earlier, our methods can be used to support a variety of instructional practices, including teacher-led differentiated small group instruction, collaborative problem solving, RPT, or even just making sure groups of students are working on similar content in the tutoring system so that the instructor and peers can more easily provide support.

While our algorithms are designed to work specifically with ALEKS, it is worth noting that they can easily be adapted to any system that partitions student ability into many distinct skills and provides a global assessment of student performance on these skills. Specifically, closest fits are systems that possess a KST-like model that captures prerequisite and postrequisite relationships among many discrete skills of an academic course. For example, the ASSISTments platform has an initial assessment, PLACEments, which uses a prerequisite map to determine which skills each student needs to practice (Heffernan & Heffernan, 2014). Alternatively, our methods could also be modified to accommodate approaches that employ assessment techniques to represent student knowledge at a fine-grained level without specifying prerequisite relationships. For example, our methods could be utilized with Curriculum-Based Measurement (Deno, 1985; Stecker et al., 2005), which is already used in math and reading and has been paired with heuristics that form peer tutoring pairs (Fuchs et al., 1994). We note, however, that it may be more difficult to use such algorithms with mastery learning-based tutoring systems that have a fairly fixed curriculum sequence and require students to attain mastery on one set of skills before moving on to another (e.g., cognitive tutors; Anderson et al., 1995). For such platforms, students have typically not been assessed on skills that they have not yet reached, and students are assumed to have already mastered skills that have been completed, making it difficult to group students together based on their relative mastery of different content.

Adopting a fine-grained view of student ability also resulted in insights about the nature of student ability in the classroom. While ability grouping and tracking are common practices in classroom settings, our findings suggest that adopting a multidimensional view of ability could lead to opportunities to have students work together, even if they may traditionally be considered as having different ability levels. These findings may suggest the need for adjusting current notions of ability and controversial practices such as tracking (Cipriano-Walter, 2015; Finley, 1984; Gallardo, 1994; Kelly, 2004; Oakes, 2005), as such practices may negatively influence students’ self-concepts in mathematics (Chiu et al., 2008). Moreover, the reciprocal pairing method showed that every student in the class possessed some piece of knowledge that another classmate did not and moreover was ready to learn, which should not be understated. From this, we obtained evidence that supports an encouraging perceptive, which is that each student has the potential to be a meaningful contributor to the learning process of their peers and that no student is without unique strengths. In practice, an RPT session may help to instill these encouraging views onto the learner, which may lead to new student perceptions of their own ability, and thus lead to increased levels of achievement.

While the results and insights to this study are highly encouraging, we recognize there are several limitations. First, the metrics used to evaluate our methods were also developed for the current study, mainly because the literature does not address group formations based on a fine-grained view of ability like the current study. Therefore, the lack of widely accepted metrics for the given context makes it difficult to validate our choice in metrics. Nonetheless, we emphasize that our metrics were chosen based on what we anticipate teachers would care about in practice, rather than to simply favor our methods. However, we recognize that they may not be the only useful metrics to consider in practice, and working with teachers to improve upon and implement these methods may surface more important metrics. Another limitation is that we only performed our analysis on one class for each grouping method, and therefore we cannot be certain if our findings generalize across different classes and contexts. All classes were Algebra 1 classes, which tend to be taken in the 9th or 10th grade in high school. It remains to be seen what the quality of the groups would be for other classes of different grade levels (e.g., elementary school and middle school). Additionally, future work could investigate to what extent does class level characteristics such as initial ability, learning rate, and class range in overall ability have an effect on the measures of the chosen metrics. We note though that we did not “cherry pick” classes in order to improve the performance of our methods; rather, we chose these classes based on the amount of available data and other considerations (see Appendix A for more details). In addition, although we are careful to take a fine-grained view of ability, we did not consider other important student attributes that are highly influential as it relates to collaborative learning dynamics such as gender, friendship, learning perspectives, personality traits, engagement, and motivation (Donovan et al., 2018; Kanika et al., 2022; Liang et al., 2021; Sanz-Martínez et al., 2019). In the case of the reciprocal pairing method, it may be feasible for teachers to perform multiple runs (with the assistance of a tool) to allow for selecting a configuration that meets certain requirements unrelated to ability. In future work, we could consider modifying the algorithms to handle additional teacher-specified constraints. For example, the reciprocal pairing method could be performed on a reduced set of content as to prevent RPT exchanges involving widely different areas of the curriculum.

Although beyond the scope of our study, perhaps the most notable limitation is that we do not actually evaluate how these groupings would be used by teachers and how they might impact student learning. An interesting question to consider in future work is whether students tend to be repeatedly grouped with the same peers in a real-life situation. We saw an example of this with the curriculum-wide grouping method where students were repeatedly grouped together to work on a geometry subslice; however, this is not necessarily expected in a real-life situation, since presumably once students are grouped to work on the same subslice, as long as they practice that subslice on ALEKS, they will subsequently be grouped to work on another subslice, possibly with a different group of peers. Moreover, the effects of implementing these grouping methods may get lost in the noise when students are learning the material. Therefore, it is imperative that future evaluation studies of these methods take a mixed methods approach where not only student outcomes are examined, but also where teacher perceptions, attitudes, and potential implementation difficulties can be investigated, including fidelity in adhering to the grouping and content recommendations produced by these algorithms.

Ultimately, we recognize the need to study the usability and effectiveness of these group formations in a real classroom setting. We are currently embarking on design-based implementation research studies to work with educators in implementing grouping methods in classroom settings. After initial meetings with potential partners who already implement ALEKS in a blended learning model, we are encouraged to know that they are independently interested in small group implementation strategies using ALEKS data. More broadly, we hope that our methods show how we might form a bridge between personalized educational technologies and teachers’ pedagogical practices. If proved to be useful, such a bridge could (a) increase and improve how the educational technologies are used in classroom settings, and perhaps more importantly, (b) improve teachers’ pedagogical practices outside of the technology itself.