Introduction

In recent years, data collection in educational settings has been increasing. While this is partially attributed to institutional audit culture and its practices of benchmarking and formalizing accountability (Shore and Wright 2003; Shore 2008), the rise of technology use in classrooms (Tondeur et al. 2017; Long et al. 2017) coupled with advances in artificial intelligence and machine learning algorithms, which heavily rely on large quantities of data, has accelerated this trend. One purpose for the collection and use of this data is to create predictive models of learners, the targets of which range from academic performance to affect and engagement in class (Gardner and Brooks 2018). An application of these models include early warning systems (Macfadyen and Dawson 2010), which are used to alert advisors, instructors, administrators, or students themselves if a student appears to be struggling so that they can be supported before they fall significantly behind (Alhadad et al. 2015).

However, these systems often rely upon the collection of sensitive data such as demographics, grades, and interaction traces with online content (Pardo and Siemens 2014) that students are uncomfortable sharing for learning analytics (Ifenthaler and Schumacher 2016) depending on the stakeholder involved. For instance, third parties, such as Learning Management System (LMS) vendors, have also turned to developing early warning systems and products that rely on educational data, even though such data sharing arrangements may be unclear to students (Polonetsky and Jerome 2014). The manner by which data collection is conducted thereby creates a tension between institutional goals of using predictive models to support students’ educational progress and retention, instructor goals of course-specific performance monitoring, and upholding commitments to learners’ consent, agency and privacy (Pardo and Siemens 2014; Prinsloo and Slade 2014b).

There have been numerous calls to provide students with more agency regarding how data is used in learning analytics (Pardo and Siemens 2014; Drachsler and Greller 2016). Yet, students’ privacy concerns may deter them from consenting to the use of the data in learning analytics. Moreover, biases have been shown to exist in predictive models, partly due to non-representative samples acquired during data collection (Ocumpaugh et al. 2014). As the availability of data is restricted, machine-learned models may have a reduction in accuracy which can lead to less effective interventions for some (or all) students (Li et al. 2019). This is particularly concerning since demographic gaps already exist in educational achievement (Bainbridge and Lasley 2002), and is especially true for underrepresented minorities (Bensimon 2005), those with a lower socioeconomic status (Duncan and Magnuson 2005), and between genders in certain contexts such as STEM programs (Matz et al. 2017). Not only are there outcome discrepancies, but it has also been shown that different demographics-based communities have different expectations of privacy and concerns when it comes to how their data is to be used (Cho et al. 2009). If students in the minority groups or with a particular background are more reluctant to share data, their data will be absent, which may end up biasing models in ways that are not representative of all students.

In this study, we investigate students’ propensity to consent to or opt out of having their data collected and used for learning analytics. We further connect consent propensity to students’ demographics, personality characteristics, privacy perceptions, as well as students’ perspectives and concerns regarding learning analytics in order to understand the factors motivating students’ expressed consent preferences. Linking participants’ responses to demographic characteristics enables us to analyze the differences between student subpopulations and how those might translate into differential consent rates. The research questions we address are as follows:

  1. [RQ1:]

    What are students’ perspectives on their educational data being used in learning analytics?

  2. [RQ2:]

    What are the population and participation characteristics of students who indicate a preference to allow or disallow their educational data to be used for learning analytics?

In “Methods”, we describe our study to answer these questions by first ascertaining students’ propensity to consent or deny use of their educational data for learning analytics with an email-based, one-question preference elicitation prompt. Respondents were subsequently invited to complete an online survey that investigated the factors behind their consent indication in order to identify key determinants. The email prompt and online survey responses were then associated with students’ institutional demographic data in order to contextualize the relationship between students’ demographic characteristics and their propensity to participate in learning analytics. We sent our email prompt to a sample of 4,000 students at our institution stratified by ethnicity and gender; 272 students responded to the email prompt, of whom 119 further completed the survey.

In “Findings”, we found differences in response rate to the email prompt among genders and ethnicities. Female students were much more likely to respond than male students and, despite stratified recruitment, responses from White students were overrepresented while responses from Black students were underrepresented; there were no differences in consent behavior between genders nor ethnicities. Among respondents, we identified three important factors which play a role in students’ consent expressions regarding learning analytics: student’s trust in the educational institution, a student’s level of concern regarding individual data collection, and a student’s comfort with an instructor’s use of data for improving student engagement. Certain privacy attitudes are correlated with population subgroups, most notably students’ identifying as Black generally express less trust in the institution, and female students tend to have greater apprehension about personal data collection while simultaneously being comfortable with instructor use of such data to improve student engagement.

Our findings suggest that instructors may have an important role in making students feel at ease when it comes to data sharing. We discuss in “Discussion” how this comfort may be bolstered by being more transparent regarding who data is used and who has access to it, thereby balancing broader institutional interests of effectively educating students while maintaining individual privacy safeguards and student agency. We also discuss limitations of the current study and routes to deepen our understanding of the rationale behind students’ consent decisions.

Background and Related Work

We discuss prior work on privacy and ethical concerns regarding learning analytics, equity and disparities in education, and sociocultural orientations in education.

Privacy and Ethical Issues in Learning Analytics

Learning analytics relies on the collection and use of student data that may include sensitive information and confidential records, which raises privacy concerns (Drachsler and Greller 2016; Ifenthaler and Schumacher 2016; Reidenberg and Schaub 2018). Meanwhile, broader changes in society emphasizing individuals’ rights in data processing are reflected in new privacy regulations such as Europe’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) and California Privacy Rights Act (CPRA). In this light, discussions around the implications of collecting, using, and analyzing student data in educational contexts are becoming more critical (Prinsloo and Slade 2017; Niall 2017). Existing research has pointed out emerging privacy and ethical issues around learning analytics, including student consent and agency over student data, and their trust in learning analytics systems (Pardo and Siemens 2014; Drachsler and Greller 2016; Rubel and Jones 2016).

Student consent is critical not only to demonstrate respect for them and their decisions, but also to support important values such as autonomy and freedom of choices (Sedenberg and Hoffmann 2016). Also, considering student consent is to acknowledge students’ rights and voluntary collaboration to allow the collection and use of student data by learning analytics in supporting student learning (Slade and Prinsloo 2013). It is an ethical approach when institutions include codes of conduct that guide informed consent, data collection purposes, and transparency of data use to minimize potential harm and allegations of misuse (Land and Bayne 2005; Slade and Prinsloo 2013).

Prior studies have found several sociodemographic characteristics contributing to disparities among demographic groups when it comes to their consent to participation in research such as age (Jacobsen et al. 2004; Benfante et al. 1989), gender (Ramos et al. 2004; Pirzada et al. 2004), socioeconomic status (Boshuizen et al. 2006; Gordon et al. 1959), and ethnicity (Moorman et al. 1999, 2004). Li et al. (2019) found that student consent or opt-out decisions can affect the predictive power of learning analytics models for different student subpopulations. In our study, we quantified students’ participation and consent rates for learning analytics by demographic groups, which is important for contextualizing the differential effects identified by Li et al. We further investigate the underlying reasons as to why students choose to consent or opt out of learning analytics, and how these factors are linked to demographic characteristics and personality traits.

Meanwhile, consent is closely related to autonomy and agency (Alexander 1996). Student agency is characterized as students being able to hold themselves accountable to make decisions in learning processes, which is critical to students’ learning engagement and pursuit of learning goals (Deakin Crick and Goldspink 2014; Seifert 2004). To increase student agency and empowerment to participate in learning analytics, students should be viewed as collaborators in learning analytics rather than data producers or service receivers (Buchanan 2011; Kruse and Pongsajapan 2012), and Sun et al. (2019) found that students demand more agency in decisions regarding how data about them is used.

On the other hand, current consent practices face challenges and critiques as consent is often perceived as an operational act rather than being understood and assented with moral legitimacy (Barocas and Nissenbaum 2014). Barocas and Nissenbaum (2009) identified that in the online behavioral advertising context, consent often neither sufficiently capture users’ agreement to tracking and targeting nor conveys meaningful notice that could facilitate users’ choices due to the disconnection of privacy policies from different parties (e.g., data publishers, contracted third parties), the changing nature of privacy policies, and the lack of data flow transparency to users. As a result, accompanied with the asymmetrical power relationships with companies, people could feel powerless toward the inevitable privacy violations, a social phenomenon described by Draper and Turow (2019) as digital resignation. Obfuscatory consent practice may confuse people and discourage them from demanding agency (Draper and Turow 2019; Ellison and Ellison 2009).

Furthermore, students’ trust in learning analytics systems plays a critical role in supporting an educational ecosystem that maximizes the experiences of different stakeholders such as learners and educators (Drachsler and Greller 2016), creates reciprocal relationships between the institution and the students to encourage students to share their data for learning benefits (Slade et al. 2019), and facilitates the establishment of reliable analytics systems (Petersen 2012). Prior work has also found several factors positively influencing students’ trust in learning analytics, such as protecting data to avoid unauthorized access or distribution, proper storing of historic data, data de-identification, valuing student privacy, achieving consensus on data collection purposes, and transparency of data collection (Pardo and Siemens 2014; Clarke and Nelson 2013; Drachsler and Greller 2016; Slade and Prinsloo 2013; Beattie et al. 2014). Recent work shows that students inherently trust and expect their institution to properly and ethically use student data (Slade et al. 2019). Our study explores whether students’ trust in the institution affect students’ propensity to consent to learning analytics.

Equity and Disparities in Education

Equity has long been a fundamental concept in education. Simon et al. (2007) describe equity in education as twofold: fairness and inclusion. Equity as fairness illustrates that an individual’s socioeconomic status should not affect their chances to pursue education. Equity as inclusion acknowledges the basic need to complete compulsory education in order to acquire the skills needed in society.

Racial equality remains a controversial issue in education due to disparities in academic outcomes and limited access to opportunities and resources for students of color (Noguera 2016). Students from Hispanic/Latinx, African American, American Indian, and Pacific Islander groups are underrepresented at all levels of higher education from undergraduate majors to graduate program pursuits, particularly in STEM-related fields (Hanson 2008; Cook and Córdova 2007).

Even with successful graduation from higher education, individuals from minority groups are less likely to consider pursuing research careers (DePass and Chubin 2008).

In the late-90s, Ladson-Billings (1998) stated that “the intersection of race and property [is] a central construct in understanding a critical race theoretical approach to education”. While the fundamental belief in critical race theory (CRT) is to “recognize the experiential knowledge of people of color” (Matsuda 2018), education scholars hold the same belief and have recognized more aspects in CRT that align consistently with education equity goals (Ladson-Billings 1998; Dixson and Rousseau 2005) such as, the inherency of CRT to historical and contextual analysis, its challenges to mainstream neutrality, objectivity, color-blindness, and merit, as well as its values on the opinions of people of color (Crenshaw et al. 1995; Matsuda 2018).

Gender equity in education has also been a subject of national debate as shown when the American Association of University Women (AAUW) published The AAUW Report: How Schools Shortchange Girls (Bailey et al. 1992), which marks a series of efforts to support gender equity through the introduction of topics such as race and gender on campus and girls in science and technology (Corbett et al. 2008). Gender disparities have been shown to play a role in students’ academic performances, school experiences, education outcome, and barriers while achieving their educational goals (Buchmann et al. 2008; McWhirter 1997; Grossman and Grossman 1994).

As learning analytics aims to support teaching and learning for all students (Diaz and Brown 2012), discussions arise around the fair use of learning analytics (Prinsloo and Slade 2014a; Roberts et al. 2017) and predictive models (Dwork et al. 2018; Friedler et al. 2019; Liu et al. 2018; Gardner et al. 2019) due to the potential biases and lack of impartiality in such algorithms (Cofone 2018; Richardson et al. 2019); this could lead to inaccurate modeling for populations that are not well represented (Li et al. 2019; Ocumpaugh et al. 2014). When coupled with the fact that minorities are already less likely to consent in numerous contexts as we described in “Privacy and Ethical Issues in Learning Analytics”, it becomes crucial to understand how characteristics such as gender and race affect students’ consent propensity in the context of learning analytics to avoid developing models that inadvertently widen disparities.

Sociocultural Orientations in Education

Learning science research has established that students’ academic performance is related to factors such as their personality traits (Zhou 2015), cultural background (Niles 1995), and competitiveness and cooperativeness (Baumann and Harvey 2018). An individual’s competitiveness and cooperativeness is part of their social interdependence orientation (Johnson et al. 1998; Johnson and Norem-Hebeisen 1979), and such characteristics are associated with one’s gender, cognitive and social development (e.g., perception and response in group settings), attitudes toward the educational institution and relevant people in that environment (e.g., other students and teachers), and perspective-taking ability (Madsen 1967; Johnson and Engelhard 1992). More specifically, positive social interdependence is (cooperation) established when individuals from a group share common goals and their collective actions affect the group outcomes (Johnson and Johnson 1991; Deutsch 1949). In other words, people cooperate when they realize that they would not accomplish the goal without everyone working towards it (Johnson et al. 1998; Johnson and Johnson 1991). Relatedly, people’s gender (Ramos et al. 2004; Pirzada et al. 2004) and perceived contributions of their consent to research benefits (Kim et al. 2017) (as an example of perspective-taking) have been shown to affect consent. We therefore explore whether students’ competitiveness/cooperativeness, as a representation of their various underlying cognitive and social developments, would be a factor influencing their willingness to consent, as well as their perspectives on data collection and use.

Furthermore, the culture aspect of social orientations can reflect an individual’s decision-making considerations and motivation to succeed (Johnson and Engelhard 1992; Triandis 2018). Among different dimensions of culture measurements, individualism-collectivism (IND-COL) has been the most studied (Hofstede 1984; Cozma 2011). IND-COL can be a key characteristic of an individual’s racial identity (Nobles 2006), influences how people prioritize personal goals versus group goals (Schwartz 1990; Yamaguchi 1994), and can be used as a framework to analyze if one feels connected to and responsible for the group they belong to (e.g., students’ perception of their roles and responsibilities as students) (Taylor and Moghaddam 1994; Triandis et al. 1988). Carson (2009) also identified that collectivism is reflected in students’ belief of education purpose and the way they evaluate academic success. As we discussed in “Privacy and Ethical Issues in Learning Analytics”, consent is closely related to one’s agency and student agency relies on students being considered collaborators in learning analytics. Thus, we investigate if there is a relationship between students’ sense of responsibility to contribute their data (as a form of collectivism) and their consent practices.

Methods

Our study investigated two primary research questions: (RQ1) What are students perspectives on their educational data being used by learning analytics systems in the form of predictive models? and (RQ2) What are the population characteristics of students who indicate they would consent or opt-out of participating in such uses? In order to investigate these questions, we distributed an email-based preference elicitation prompt to students asking them whether or not they would hypothetically agree to have their data used in learning analytics systems. Upon selecting either yes or no to indicate their consent preference, students were redirected to an online survey that asked about the student’s rationale behind their consent indication and perspectives regarding their data being used for leaning analytics in different contexts and by different stakeholders. We further elicited relevant personality characteristics and attitudes that might impact students’ propensity to consent. Responses were then linked with institutional demographic data to identify correlations with consent. This study design is summarized in Fig. 1.

Fig. 1
figure 1

Our study consisted of an email prompt sent to students that included links to the online survey, which consisted of multiple components. Email and survey responses were linked to institutional student records. The analysis methods are also shown with the corresponding data needed for each approach in order to address our research questions

The study team comes from an interdisciplinary background and has a variety of experiences with student data. Dr. Brooks, for instance, has been a part of the institutional stewardship chain for student data related to learning technologies, which is adjacent to the data we collected. In addition, Dr. Schaub has been involved in institutional processes related to privacy and learning analytics, and the whole study team has been involved in student modeling and educational data science research at the institution including qualitative and quantitative approaches in the past. Next, we explain each part of the study design in greater detail. Our study has been approved by our Institutional Review Board.

Measuring Privacy Perceptions, Personal Traits, and Decision to Consent

We sent a one-question email prompt to a stratified student sample to understand student’s consent decision regarding data being used by learning analytics. Li et al. (2019) found that the use of either “opt-out” or “opt-in” wording leads to different response rates from participants. Thus, we prepared two variants of the email prompt, shown in Fig. 2.

Fig. 2
figure 2

The email for the opt out condition is on the left and the email for the opt in condition is on the right. They are identical except for the last paragraph and consent options. Each student was sent only one of these two versions

Students only saw one framing and we conducted pilot testing to ensure that the wording did not lead to confusion. Once a student clicked either the yes or no link, the response was logged with an identifier to link it with their corresponding institutional demographic records. Identifiers were subsequently discarded before analysis. Regardless of response, respondents were then directed to a debrief that explained the purpose of the study, an informed consent form, and an invitation to participate in an optional online survey. Participants who completed the online survey were compensated $5.

The email prompt allowed us to ascertain propensities for students’ consent to learning analytics data use. Our online survey further explored why such decisions were made. Note that we intentionally used a broad consent message in order to study the factors and pre-conceived notions about learning analytics that influence students’ consent decision. We are not advocating for this prompt as an exemplar for broadly soliciting data consent decisions on live systems.

For the survey questions (see Appendix A for the full survey instrument), we iteratively refined the wording to minimize misinterpretation, and pilot-tested the questions with a group of about 10 undergraduate and graduate students working on privacy-related and educational technology research. While the survey contains multiple scales, most questions were Likert scale items that did not require significant cognitive load to process. We also provided fair compensation based on the average completion time of 15 min, which we do not consider to be excessive, though it is presumable that some participants exited due to length. As shown in Table 1, of the 272 people who clicked one of the options in the email prompt, 150 actually consented and started the survey, of whom 116 completed the survey, i..e, the survey completion rate is 43%.

Table 1 Response rates per condition

At the beginning of the survey, participants were asked in three open-response questions to describe the important factors that affected their consent decision, perceived benefits of student data being used by learning analytics systems, and concerns with such data use. Next, we asked participants to rate their level of comfort on a seven-point scale with their educational data being used in five scenarios by different stakeholders for different purposes (e.g., “help instructors gain insights about students’ engagement”).

We further assessed students’ level of competitiveness and cooperativeness in the educational setting using the Social Interdependence Scale (Johnson and Norem-Hebeisen 1979) as such characteristics are associated with one’s attitudes toward the educational institution, the relevant people in that environment (e.g., other students, teachers), and perspective-taking ability (Madsen 1967; Johnson and Engelhard 1992). We aimed to explore if students’ competitiveness/cooperativeness as a representation of their various underlying attitudes would be a factor influencing their willingness to consent.

Given that students’ trust in the institution has been shown to be a fundamental factor influencing students’ learning experience (Van Maele et al. 2014), we wanted to understand whether students’ institutional trust might impact their consent propensity. We used Ghosh et al.’s trust scale (2001) that defines trust as students’ confidence in the institution’s ability to support students achieving learning and career goals.

Students in our institution come from diverse cultural backgrounds. We hypothesize that students’ sense of responsibility to contribute their data (as a form of collectivism) could be a potential factor affecting their consent practice. Thus, we use the Horizontal and Vertical Individualism and Collectivism measurement scale (Triandis and Gelfand 1998) to evaluate horizontal individualism, vertical individualism, horizontal collectivism, and vertical collectivism. Since prior work has found students having privacy concerns regarding learning analytics (Pardo and Siemens 2014; Picciano 2012), we also included the Internet User Information Privacy Concerns Scale (IUIPC) (Malhotra et al. 2004).

Finally, we asked demographic questions, including gender, ethnicity, first-generation college student status, and year of study in order to understand key factors in the decision to consent with regards to demographic characteristics. While we had access to institutional demographic data for participants, which we also used in our analysis, this self-reported demographic information allowed students to self-identify gender including non-binary gender options. We further asked to specify their country of origin. However, because not all respondents to the email prompt completed the survey, we used institutional records for ethnicity and gender in our statistical analysis. For year of study, we use students’ self-reported class standing.

Recruitment & Participants

As the data is collected from a single institution, we briefly describe the University of Michigan (UM) to help contextualize the work for others seeking to apply our results. Demographic statistics are obtained from the most recent figures (University of Michigan AA 2020) published by the Office of Diversity, Equity, and Inclusion (DEI). The student body is skewed towards higher socioeconomic status; the gender composition is balanced with 50.6% identifying as man, 48.3% identifying as woman, and 1.1% as transgender or gender non-conforming. UM is a large four-year, primarily residential, majority undergraduate, full-time, more selective university with lower transfer-in rates and very high research activity (Carnegie Classification IHE 2017). Out of approximately 46,000 students, the mean age of students is 22.7 with 7.9% coming from backgrounds where neither parent or guardian has attended college. 75.0% of students were born in the US and the ethnic composition is as follows: 4.3% African-American or Black, 24.2% Asian-American or Asian, 6.3% Hispanic or Latinx, 1.7% Middle Eastern or North African, 0.1% Native American or Alaskan Native, 57.9% White, 1.0% Other; 4.6% specified one or more of the previous categories.

The institution has a history of advancing DEI and its stance is that DEI is key to individual flourishing, educational excellence, and the advancement of knowledge (University of Michigan AA n.d.). In 2017, the university established the Learning Analytics Guiding Principles (University of Michigan AA 2017) that define learning analytics, and set respect, transparency, accountability, empowerment, and continuous consideration as UM’s core tenants regarding research in this field. The Center for Academic Innovation (University of Michigan AA 2020) also develops projects to extend academic excellence and provide sustainable solutions to advance learning, facilitate problem solving, foster equity and inclusivity, and increase access and affordability.

Students were recruited based on specific demographic characteristics in the institutional database containing students academic records and demographic details. For each email variant (opt-in versus opt-out wording conditions) we recruited 2,000 students with each student receiving only one version of the survey (total emails sent was 4,000). Each sample of 2,000 students was selected using a disproportionate sampling method in order to ensure a balanced data set. The population was first divided into 5 strata based upon the ethnic categories listed in the institutional data (White/Caucasian, Asian, Black/African, Hispanic/Latinx, Other, which included those who indicated two or more ethnicities, Hawaiians, and Native Americans). Each stratum was also balanced with respect to gender.Footnote 1 This meant that each ethnicity-gender group had n = 552 participants with the exception of Black/African students (n = 342 for both males and females) due to scarcity.

Quantitative Analysis: Identifying Factors in Consent Decisions

We used a logistic regression to control for the various factors outlined and to identify which considerations are most important in students’ decision to consent. For the scales used, we computed a composite score based on items within each of its subscales. This procedure compressed the number of survey scale features to 24. Note that the five comfort rating questions were used as-is (a discrete value in the set 1 thru 7, inclusive). Full details of this analysis method along with models are found in the accompanying computer code in https://osf.io/sg4rk/.

Ensuring Data Quality and Correctness

To minimize errors due to low-quality answers, we checked survey responses for speeding and straightlining. Manual review of particularly fast and slow responses revealed no anomalies. Therefore, we choose to keep all responses.

We identified outliers and influential points by plotting Studentized residuals and Cook’s distance for each observation, using an absolute value > 2 and 4/(Nk − 1) as thresholds respectively, where N is the total number of observations and k is the number of explanatory variables. Studentized residuals are the residuals divided by estimates of the standard deviation, while Cook’s distance summarizes the effect of removing an observation on the fitted response values. This resulted in 15 flagged points. Manual inspection made it evident that 2 people had accidentally selected a different option either by accident or due to a misunderstanding of the prompt. For instance, one explicitly stated that, “I misread the choices. As it said yes I assumed it meant to opt-in, not ‘yes, I would opt out.’ ”; such answers were corrected. The remainder of the flagged items did not reveal any other evidently concerning issues. Removing all of these points results in quasi-complete separation and a large shift in the coefficients. Thus, it may be the case that those who denied use of their data were considered “unusual” solely because the overwhelming majority of students consented to data use for learning analytics; 15 of only 25 respondents who did not consent are in this list. We choose to retain these points in the model as they represent important perspectives to consider.

Model Fit and Feature Selection

We fit a logit regression model using maximum likelihood estimation. The input variables includes the 24 survey features, one for each subscale. The binary outcome variable was whether a student consented or denied the use of their data. Emphasizing the fact that we are interested in understanding specific factors, a feature selection processes was used for pruning the list of inputs into a smaller subset. This helps ensure that the significance values used to make these determinations are reliable, that confidence intervals on regression coefficients are sufficiently narrow, and violations of the linearity assumptions are addressed.

We used the variance inflation factor (VIF) as a gauge for multicollinarity and note that a number of features had a VIF above 5, indicating a problematic amount of collinarity. This is not necessarily surprising given that it is plausible to expect that some the measured concepts will be correlated to each other, particularly since we constructed composite scores based on subscales of an overarching latent trait. We alleviated this by conducting feature selection with recursive feature elimination (RFE) with 20-fold cross validation, which removes features iteratively based on feature importance, as well as backwards elimination (BE) with a threshold set at p < 0.05, which removes features in accordance with the highest p-values. We then choose the features common to both pruned models with p < 0.05.

Quantitative Analysis: Understanding Relationships Between Key Consent Factors and Demographics

All demographic data was one-hot encoded for categorical variables (year of study, gender, and ethnicity into N − 1 dichotomous variables where N is the number of categories). We choose the category with the greatest population to exclude as a reference in the linear regression model. Therefore, an input was created for each demographic listed in Table 4 with the exception of “White”, “Sophomore (2nd Year)”, “Not First Generation”, and “Male”, which was our comparison group, resulting in a total of 8 one-hot encoded columns. The input variables are these 8 columns, while the target variable is the composite score for each of the Nf key factors identified using the feature selection process described in “Model Fit and Feature Selection”. This results in a total of Nf separate ordinary least squares models, one for each target variable.

Diagnostics were conducted for each of these models. There were no indications of collinearity. Plotting the residuals against fitted values did not suggest any egregious outliers, nonlinear behavior, or major concerns regarding heteroscedasticity, which may deflate p-values due to increased variance that is unaccounted for in the model. Thus, we are reasonably confident in our coefficients and statistical conclusions, described in “Findings”.

Qualitative Analysis

To analyze open responses to survey questions, we engaged in successive rounds of open coding, in which one researcher went through all responses and developed an initial codebook (Saldaña 2015), followed by iterative codebook refinement by two of the authors independently coding a subset of responses and then jointly reconciling disagreement. After four iterations, high inter-rater reliability (Cohen’s κ = .77) was achieved. One researcher then used the final codebook to recode all responses. The final codebook consisted of 15 themes with 29 unique codes, see Appendix B.

Findings

Beginning with the quantitative analysis, we present statistics about response rates, show the key factors from the scale items in willingness to consent and how these are correlated with demographics according to our regression models. The qualitative analysis is then presented based on students’ answer to the open-ended survey questions, laying out self-reported factors, benefits, and concerns regarding students’ views on data sharing, organized by subpopulations of students making similar statements.

Quantitative Analysis Findings

We break down our discussion of the quantitative analysis into three parts: statistics regarding participation rates during our initial email engagement with students, results regarding our logistic regression model used to identify primary factors underlying student’s decision to share data (RQ1), and results from the linear regression models, which explain demographic correlations with each identified factor of importance (RQ2). We find that there is both a gender and ethnicity gap between groups when it comes to response rate. The key factors identified behind the decisions of students who did respond were trust in the institution, level of general concern regarding individual data collection, and comfort with instructor use of data for classroom engagement. Institutional trust was generally higher for female students and lower for students who identify as Black, while data collection concerns and comfort with instructor data use were higher for females when compared to males.

Response to Email Prompt

Table 1 describes response rates split by email wording condition. Despite the low overall response rate of 6.8% we find that, generally speaking, most people (72.4%) consent to data usage when they do respond. We do not find any effect on the participation rates (link clicks) between the opt-in and opt-out conditions. While the consent rate is somewhat lower for the opt-out condition; a two-tailed test for proportions shows that there is no statistical difference between these conditions (p = 0.39) when only considering those who made a selection. Therefore, for the remaining analysis, we combine the opt-in and opt-out conditions and look only at the aggregate data, given that the differences are negligible. This confirms the result in Li et al. (2019), which states that wording has no effect on participation rate, but contrasts with their findings regarding consent rate where a difference was found between conditions.

We also decompose click rates to analyze participation by subpopulation, such as ethnicity and gender, see Table 2. There is a significant difference between the number of clicks, or engagement, between male (106) and female participants (166), despite gender-based stratification in recruitment. Given those who did respond however, the consent rates do not deviate from expectation. A Chi-Squared test indicates that gender is independent of the consent rate, but this is not the case for number of clicks (χ2 = 13.24, p = 0.0003, Cramer’s V = 0.22); there is a moderate association.

Table 2 The number of link clicks and the number of people who consented to data use

A similar case holds for ethnicity: the consent engagement differs quite drastically between subpopulations, especially when compared with the expected number of link clicks. While Asian and Hispanic respondents’ answers align with expectation (percent deviation of -4% and -1% respectively), there is a notable overrepresentation of responses from those identifying as White (by 31%), and an underrepresentation of answers from those identifying as Black (by − 40%). Once again, we find that ethnicity is not independent from the click rate (χ2 = 14.32, p = 0.002, Cramer’s V = 0.13) with a medium effect size, whereas there is no such relationship with consent.

Given the aforementioned discrepancy, we ran a more particular test to see if there are true differences between the proportion of those who click on an answer within these subpopulations. Namely, we divide the sample into those who identify as Black and non-Black (case 1), as well as those who identify as White and non-White (case 2). The sample statistic in case 1 is − 0.03, with a 95% confidence interval (CI) of [− 0.053, − 0.012], corresponding to p = 0.002. For case 2, the sample statistic is 0.028, 95% CI of [0.011, 0.046], and p = 0.0015. Therefore, Black students participate less when compared to those those who are not Black, and White students participate more when compared to non-White students.

Identifying Primary Factors in Participation

We explore the reasons behind the differential engagement by subgroup and address RQ1 by identifying the key factors that led to students’ decision to consent or deny use of their data for those who did respond. We fit a logit model where the input variables are the factors impacting students’ willingness to consent with the binary outcome variable being whether or not a student consented or denied consent of the use of their data. Since the goal is to identify critical factors, our focus is not to achieve the highest predictive accuracy, and we note that this logit model is not and should not be used to generate predictions of consent without further error analyses as minority subgroup classifications may be unreliable and skewed towards the majority class distribution for imbalanced datasets. We further note that the feature selection process described below are based on coefficient significance values that are decoupled from predictions or measures of fit.

Our final model was obtained by conducting feature selection using two techniques: recursive feature elimination (RFE) with 20-fold cross validation as well as backwards elimination (BE) with a threshold set at p < 0.05 as described in “Model Fit and Feature Selection”. As BE yielded a feature subset of the RFE approach, we fit our model with the set from the most stringent standards and focus our discussion on the results of BE. Specifically, BE indicates a subset of 3 factors that have an effect on the response variable and we consider the following to be impactful in students’ decision to consent: one’s trust in the institution, concern in the amount of personal data collected, and comfort with instructor use of data for instructional purposes. Table 3 shows summary statistics for the final model.

Table 3 Summary of our logit model with three factors: institutional trust, the “Collection” subscale from IUIPC, and the self-developed stakeholder question regarding comfort with instructor use of data

The odds ratios for each of these key factors may be interpreted in terms of a percent increase in the likelihood to consent to data usage given a one-point change in each particular subscale. Since all of the items in these three subscales are based on a 7-point Likert scale, a one-point increase in institutional trust means that students’ likelihood to consent increases by 132%. A one-point increase in comfort with instructor data use leads to a 212% jump, while concern regarding data collection by one point drops the chance of students consenting by 78%.

Demographic Correlations with Key Factors in Consent

For each of the three key factors identified—institutional trust, data collection concern, and comfort with instructor data use—we then analyze correlations with demographic characteristics. First, we provide summary demographic statistics for those who completed the survey in Table 4. Note that similar to the email engagement findings, the discrepancies between demographics is also present in the survey completion rates: students who identified as White or students who identified as Female students are overrepresented while students who identified as Black are underrepresented.

Table 4 Demographic counts for the number of students within each category for ethnicity, year of study, gender, and self-reported first-generation college student status

To understand the potential reasons behind these differences we address RQ2 and identify correlations between demographics and influences impacting willingness to consent by running a linear regression model per key factor identified. The summary statistics for each model, where the outcome variables are the corresponding subscale scores, are tabulated in Table 5. For institutional trust, we find that identifying as female corresponds to higher levels of trust relative to males. The opposite is true for certain ethnicities: Black students are less trusting of the institution when compared to White students. Female students also correlate with having more data collection concerns, as measured by the IUIPC collection subscale, while having greater comfort with instructor use of data for course purposes. Finally, the collection score has an inverse relationship with those identifying as Asian (that is, students who identify as Asian not be as concerned with personal data being collected), although this effect is a slightly weaker claim given the larger p-value (p = 0.15).

Table 5 Summary statistics for each of three linear regression models with different target variables: institutional trust, IUIPC Collection subscale, and the stakeholder question for comfort regarding instructor data use

Qualitative Analysis Findings

Our analysis of students’ responses to the open-ended survey questions revealed more nuanced student perspectives on data collection and factors influencing their willingness to consent. We found that students recognized that allowing their student data to be used in learning analytics systems can beneficially contribute to improving education, supporting new research, and positively impact other students, while also expressing concerns for data privacy, data collection and ambiguity around data usage. Second, based on the patterns we have identified through statistical analysis, we report corresponding qualitative findings providing insights on students’ rationales behind such patterns. Namely, students who commented on trusting the institution all consented, while those who said they distrust the institution or the researchers denied consent. Students had varying views on data collection depending on the context of use and the stakeholders involved. We further probed students’ privacy perceptions and found diverse connections. Students’ responses revealed that they considered instructors to be key stakeholders and users of student data.

Reported Important Factors in Consent Decision

Among the 116 responses regarding important factors that affected students’ consent decision, 92 related to positive consent decisions, and 24 denied consent. For the 92 students who consented, we identifiedFootnote 2 19 factors that affected their decision to consent (see Table 6). 30% of students who consented valued that their data would be contributing to the improvement of education, learning, and teaching. A fifth of the students stated their support for allowing student data to be used for advancing understanding and research insights on student learning behaviors, teaching methods, etc., and 20% expressed willingness to contribute their student data if that could help other students or future generations to learn. Some students (17.4%) pointed out the importance of supporting research, indicating research was a factor that led them to consent. Around 16% stated some levels of privacy considerations when deciding to consent, such as valuing privacy in general, assuming that student data privacy is guaranteed by default, or indicating privacy concerns while still consenting. Student data improving learning analytics systems (14.1% of the students) and believing that allowing student data to be used is a purposeful and meaningful act (14.1% of the students) were two factors valued by students. For instance, one student said that “thinking about the greater good influenced my decision. If my student data can help improve quality of education overall, I would support its use.” Roughly 14% had a neutral response to the use of student data, for example: “I have nothing to lose when giving my data,” while 12% did not identify any concerns. 12% of students believed that contributing to student data use is important to ensure data completeness and accuracy.

Table 6 Decision factors reported by students who consented

Of the 24 students who denied consent (see Table 7), 63% expressed privacy concerns (e.g., data breach, uncomfortable sharing student data), and 20% expressed concern over lack of transparency on how data is collected, used, and by whom. 16.7% talked about the lack of proper compensation for the use of their student data. 12.5% were concerned about potential negative impacts on them, such as “I would be worried about my academic data being used in a way that negatively affects me.” A few students denied consent due to their distrust in the institution or the researchers. One student noted that use of student data could harm marginalized groups, and one student noted a lack of agency and control regarding student data use.

Table 7 Distribution of factors from students denied consent

Perceived Benefits of Data Use in Learning Analytics

Students also identified benefits of using student data for learning analytics regardless of their consent practice (see Table 8). The top three benefits mentioned by all students were the value of contributing data to improve education (52%), supporting research (38%), and positive impact on other and future students (35%). 21.7% mentioned wanting to ensure data completeness and accurate analysis. 14% thought they might be positively affected, for instance: “I think it’s important that my data be used to better improve and optimize our learning environments, which will benefit not only me, but the students that will be coming after me.” Eight students (7%) supported using student data to improve learning analytics systems. Students also recognized that student data use could positively impact the university (8.7%), faculty and staff (7%), and “others” without specifying who (5.2%).

Table 8 Distributions of benefits in using student data

Perceived Concerns Regarding Data Use in Learning Analytics

In terms of concerns (see Table 9), over 60% mentioned privacy and security concerns, such as “I’m a tad concerned that my data could be leaked to the general public. I like some privacy, so I don’t really want everyone to have unfettered access to all my student data.” 21% worried about inaccurate data interpretation. Data leakage or theft concerned 15 students (13%). 11% worried about a lack of transparency. For instance, “I am concerned about my privacy, who has access to my student data, and the real purposes for which it is being used (i.e. more than just for optimizing learning for the future).” While 11% wrote “no concerns,” another 10% pointed out negative impacts for them, such as “I feel it would violate my privacy or that it might be used against me in some way.” Some worried about data confidentiality (6.8%) or insufficient student agency and control (5.1%), and one student did not trust the institution to use the data responsibly.

Table 9 Distributions of concerns in using student data

Students’ Trust & Distrust in Institutions

Our quantitative analysis revealed institutional trust to be a significant predictor of students’ consent. Among students who consented, four (3 females: 1 Asian, 1 White, 1 prefer not to disclose; 1 male: Black) explicitly expressed trust in the institution, mentioning its reputation, accountability, and research methods (e.g., to use data properly), respectively. Another student trusted that researchers would be “handling my data appropriately and not abuse it.” In contrast, two students who denied consent (1 male, 1 non-binary; both White) expressed distrust in the institution. One of them did not “trust the university to use this data in a way that won’t hurt marginalized students.” The second did not trust that “data wouldn’t be used for commercial purposes.” Additionally, two Black students who denied consent (1 male, 1 female) distrusted researchers because it was unclear how they might use student data, and the related privacy risks and potential harms.

Our quantitative analysis further showed that Black students generally tended to trust the institution less. Of the nine Black students who completed the online survey (6 females, 3 males), 7 consented and 2 denied (1 male, 1 female). One student who denied consent expressed distrust: “I’m not sure what they’re using the data for and I don’t trust it.” In contrast, most of the 7 consenting students focused on benefits of student data use such as “potentially help another student in the future,” “be more informative and beneficial on teaching/learning methods and tools than self-report,” and “the university and other students can improve from analyzing my student data.”

Student Perspectives on Data Collection

As our quantitative analysis further showed that students’ propensity to consent to learning analytics data use is negatively correlated with their concerns for data collection, we analyzed all open-ended responses that explicitly mentioned data collection. Nine female students and one male student, from a range of ethnic backgrounds commented on data collection (comments did not differ based on gender or ethnicity). Two female students who did not consent (1 Asian, 1 White) cited discomfort with sharing personal data as the reason. One of them also expressed privacy concerns: “I am not sure which information about me is being collected and analyzed and how will that information be applied to optimize learning...who would get access to my information and to what extent.” The other 8 students, who all consented, expressed mixed attitudes towards data collection. The majority of them supported data collection for better understanding students’ performance, more accurate and representative results, research, and improving student learning. This suggests that students who emphasize benefits of data collection are more inclined to consent.

We further looked at students’ responses mentioning privacy to shed further light on their attitudes toward data collection. 65 students (41 consented, 24 denied) mentioned in privacy concerns in 92 individual responses. 72 of these responses expressed data collection concerns. Notably, there were no distinctive differences between answers from students who consented and those who did not. 33 of these students (20 consented: 16 females, 4 males; 13 denied: 8 females, 1 non-binary, 4 males), with diverse ethnic backgrounds (consented: 2 Black, 5 Hispanic, 6 Asian, 6 White, 1 American Indian or Alaska Native and Black); denied: 2 Black, 1 Hispanic, 3 Asian, 6 White, 1 Middle Eastern or North African), stated concerns and uncertainty about potential data misuse, and lack of transparency regarding data collection, data access, and data sharing. These students further expressed concerns about possible abuse or misuse of student data, such as by researchers, “the system”, for marketing, or to sell information. For instance, one student stated “what would happen after my data is used for its primary purpose? Does it just sit in a database, available to anyone for other use without my knowing or consent? Does it get deleted?

Privacy concerns of 10 female and 5 male students (10 consented, 5 denied) focused on data security, leaks, and improper exposure. 5 students who denied consent (2 White males, 1 Hispanic/Latinx female, 1 Asian male, 1 Middle Eastern or North African female) mainly worried about student data being compromised if the system is not secure enough, others noted how general privacy and security issues factored into their decision not to consent: “It feels like my data isn’t safe with anyone...how can I trust any group when every day there are news stories about major platforms/companies failing their users, intentionally or not?” The 10 consenting students also expressed concerns about potential data leakage, exposure, or hacking, and how extracted data could be “used against me” or “affect my future job opportunities.” Related, 11 students, 9 male and 3 females, (9 consented, 2 denied) noted risks of being identified. The students who denied (2 White males) stated that “the data won’t stay anonymous” or “more people would know my information,” which is similar to what those who consented said. However, we observed that students who consented tended to comment with a more trusting tone, assuming that collected student data would be aggregated and anonymized. This suggests, that whether data is de-identified and handled properly affects students’ consent propensity.

We noted earlier that all students who distrusted the institution denied consent. Trust also came up in relation to privacy concerns. Six students (3 males: 2 White, 1 Black; 3 females: 2 Hispanic/Latinx, 1 White), of whom 2 consented and 4 denied, expressed their distrust or discomfort regarding data collection, use, and access. The four denying students, 3 White and 1 Black, were uncomfortable with their information being gathered and known by different parties:“I would prefer to not have my information (i.e., classes, my learning tools) being gathered. I feel a little uncomfortable” and “I wouldn’t trust other people looking at my personal information.” Thus, students’ distrust and discomfort may also explain their reluctant attitude toward data collection, which is a driving factor influencing students’ consent propensity.

Views on Instructor Use of Student Data

Our quantitative analysis further revealed that students are more likely to consent to “help instructors gain insights about students’ engagement.” 9 students, of whom 7 consented (2 White female, 2 Black females, 1 Asian female, 1 Asian male, 1 male did not disclose ethnicity) and 2 denied (1 White male, 1 White female), noted benefits of student data use by instructors. Some believed instructors can use student data to improve teaching methods and optimize the learning process, others felt that it helps instructors better understand different types of students to provide more personalized support. It seems that students view instructors as key stakeholders and users of student data, and students are relatively more comfortable with such data use, which is positively related to their consent practice.

Discussion

We identified three main factors that influence a student’s willingness to consent to learning analytics data use: degree of trust in the institution, concern regarding personal data collection, and comfort with instructors using data to gain insights on student engagement. We now discuss our findings’ contributions to the knowledge regarding students’ privacy perspectives and behaviors in an educational setting. First, we acknowledge limitations in extrapolating results due to our survey instrument. We then highlight how varying engagement rates may suggest that some students are not being well represented and what this implicates when building AI systems and soliciting consent decisions. Next, we explore key factors and demographic trends in our findings, paying particular attention to the importance of instructors and discrepancies in trust between subpopulations. We end by discussing how institutional contexts factor into consent.

Survey Instrument Limitations

We acknowledge that an online survey is limited in trying to identify the reasons underlying consent, especially for those who are already weary of sharing data online. However, we believe this data collection is reasonably realistic (though by no means ideal) about how asking for consent may be conducted at a university: namely in an online fashion, likely via email or through a prompt for a particular learning platform. In our study, the consent message was not specific about the data collection purpose, so as to prevent priming participants. Ideally, purposes should be clearly specified and consent should be specifically for a certain purpose.

Our study’s ecological validity may be affected in that the views expressed by certain subgroups in the survey here may not apply to all students in that subgroup. A similar argument may be made for the feature selection process: there may be other considerations that students find critical to their decision, which are not reflected here. This is why we explicitly analyze relationships with ethnicity only after pruning the scope of potential consent-related factors. Yet, these findings can help narrow the scope of future work in order to further pinpoint how specific interactions, such as instructor trust (discussed in The Role of Instructors in Key Factors for Consent), relate to engagement by subgroup. We also emphasize that while the survey used to identify factors involved in students’ decision to consent was one component of our study (RQ1), another key finding for RQ2 was to empirically capture a consent process that could be reasonably undertaken, as described previously, and that not responding or clicking through is an important participation characteristic that the email prompt measures.

Nonetheless, since students are hypothetically answering their consent decision on allowing learning analytics to use student data, this might not align with their actual behavior and decision in a real world data sharing context. Consequently, mentioning a specific use case (e.g., an early-warning system used to assist student learning) or incorporating deception, may influence students’ consent propensity or increase the response rate due to greater perceived relevance and urgency. With that said, this must be weighed with ethical considerations and protocols, such as including a debrief immediately after and recognizing that it may have the side effect of diminishing the level of student trust in learning analytics research for some students, even if the risks are deemed to be minimal.

The Difficulty Posed by Varying Engagement Rates

As shown in “Response to Email Prompt”, there is a difference in response rate by subpopulation. Namely, Black students responded to the email request at a significantly smaller percentage than expected, while responses from those identifying as White were overrepresented even when we account for differences in number of emails sent to each group. The fact that underrepresentation exists supports the well-established theories described in “Background and Related Work” and is not a surprise. However, we reiterate that institutions seeking input from students regarding data use may be receiving a biased sample – not only because there are minorities in the population, but also because those who are underrepresented are even less likely to respond to a survey.

With that said, it is possible that students choose to ignore the question due to a number of factors such as the framing of the email, the perception that their decision does not matter, and for other reasons that we cannot ascertain. However, the click rate still provides key additional context, since non-participation still represents someone who the stakeholder would be making a decision for on their behalf: the data is either used, or withheld. In other words, we make no claims that intention can be derived from non-responses or the click rate, but this metric can still be linked to demographics in order to better understand the possible range at which data may be used if we were to treat those non-responses as decisions of consent or non-consent for the purposes of training a machine-learned model per se.

Thus, to avoid potential erosion of trust and tension around data usage, it may be beneficial to further explore the addition of nudging indicators to move away from broad consent practices and lessen the gap around response rate in student elicitation surveys. For instance, a prompt might mention specific data sources and stakeholders involved. This could help students understand the implications of their decisions, but may backfire if the details are overly complex and are thereby skipped or not comprehended. We could include a short paragraph explaining that it is invaluable for underrepresented students’ to provide their input to avoid biases in predictive models and to improve the educational quality for all students; social framing has been shown to impact privacy decision-making (Coventry et al. 2016). This may be presented to a random sample of students, or shown only to minority students. Other options may be varying levels of compensation based on subpopulations that are most lacking in data—paying for data that is scarce, a more costly option that brings up the broader issue of data ownership and value. It is important to note that the focus here is simply on getting people to indicate their consent preference either way, and not necessarily to nudge them to choose consent, though it is possible such changes could impact both response and consent rates (Utz et al. 2019). Consequently, depending on the intent of the actor, nudging may advance an institution’s goals while shifting the responsibility of bias towards students and countering their self-interests, so it is important to design prompts carefully.

The Role of Instructors in Key Factors for Consent

All three key factors we identified seem plausible and have intuitive explanations. It is probable that students who trust the institution and are more comfortable with instructors using data to improve classroom instruction would be more likely to share data. It is also understandable that concerns regarding personal data collection would decrease the likelihood of such an occurrence. However, of all the subscale measures and various stakeholders, instructor considerations are the most significant. This demonstrates that instructors may have significant influence on students’ willingness to share their data, perhaps even overshadowing broader concerns about general data collection or institutional practices. It is especially plausible that institutional trust and comfort with instructor are both key consent factors that influence each other, as prior studies have shown that a sense of belonging to the university affects retention and engagement (Zepke and Leach 2010), among other factors, and that teacher-student relationships contribute to these feelings of rapport (Hagenauer and Volet 2014). It is likely that students have more opportunities to form closer relationships with instructors and interact with them on a more frequent basis, whereas institutions may be attributed to administrators and other officials whose roles and direct impact are less easily ascertained.

Researchers are often required to provide consent opportunities for research subjects, but neither instructors nor institutions have such requirements or expectations to do so. Learning analytics are increasingly embedded in the tools adopted by institutional actors such as instructors and advisors, and privacy-related decision are made on behalf of students by institution’s information officers and legal counsel. However, such norms may not reflect where institutions or society are heading. Trends in recent years such as the GDPR and CCPA/CPRA demonstrate rising interest regarding issues of privacy and consent. Questions regarding consent in the coming decade, such as the ones raised in this paper, are thus important to highlight. At UM, students regularly have to consent to data sharing when using certain external or third-party tools (e.g. Learning Tools Interoperability or LTI); it is not unfathomable that instructors may play a more direct role regarding consent practices in the near future.

Consequently, some suggestions for tangible interventions include having the instructors provide more transparency regarding student data use in the classroom. For example, by telling students what educational and demographic data is being collected, the purpose for its use, and the people who have access, it may shift students’ comfort with instructors use of data, thereby changing the likelihood of student consent. Similarly, having an instructor send an email prompt asking for students to consent or deny use of their data may elicit more people agreeing to share data as opposed what may have been seen as a unilateral institutional action. Future research on the relationship between specific instructors and student trust, and whether there are correlations with various degree programs and departments across campus may yield context to better target such interventions.

Ethnicity and Gender Trust Gaps and The Role of Institutions

While it is more difficult to draw concrete conclusions between demographics and key factors in willingness to consent reported in the online survey due to lower sample sizes and overall power, some of the qualitative comments provide evidence supporting our quantitative results, such as the anticorrelation between institutional trust and students identifying as Black. Yet, there remains a fair number of Black students who did indicate some level of trust in the institution according to their open-ended responses. Discrepancies between gender displayed fairly strong signals when it came to all three key factors. Specifically, we find that those who identify as female tend to trust the institution and instructors more, despite greater concerns regarding data collection as a whole. This may seem contradictory, though perhaps it is the case where those who identify as female are hesitant, but make an exception when data use is situated in an educational context due to trust in the institution and/or the instructor. Therefore, conducting a confirmatory analysis to identify deeper rationale behind these cost-benefit considerations would be beneficial.

Additional research identifying whether certain factors outweigh others may be extracted in follow-up surveys and semi-structured interviews with specific groups of learners corresponding to student characteristics identified to be correlative with key factors such as Black, female, or students at the intersection of these identities. Questions about students’ personal experiences at their university, relationship with instructors and other stakeholders, as well as personal beliefs and attitudes around data collection specifically would provide insights into what agency students desire with respect to learning analytics. It may also uncover whether these decisions are based on firmly ingrained biases or actionable concerns, and help contribute to a more realistic model of student choices and its effects on predictive modeling in learning contexts.

Lastly, we want to differentiate the restrictions between institutional ethics boards that oversee research study procedures versus institutional data consent and collection practices. In the US, institutional review boards (IRBs) approve and monitor research involving human subjects, which includes this study. Title 45 in the Code of Federal Regulations (i.e. Common Rule) contains provisions regarding how informed consent is obtained and additional protections for certain vulnerable populations. Even then, §46.104 and §46.116 lists various exceptions where consent may be waived. Regardless, such regulations do not necessarily restrict what an institution can do; institutions wishing to engage students to gain consent for learning analytics tools or technologies do not need to do so under the banner of scholarly research.

There are many reasons why data may be collected or processed without explicit consent by the data subject. The GDPR, for instance, recognizes six legal bases of data processing of which consent is but one (Article 6). At the same time, this article also creates a potential data processing exemption for scientific research or statistical purposes, thereby leaving situations that may be up to an institution’s discretion.

The question we explore with this study is what the effects would be if institutions did engage with students for their consent to use the data. While good consent statements are transparent and meant to inform users of particular data-use practices, we made a deliberate decision to simplify the consent form, inspired by what today’s consent dialogues are, to study the general act of consenting. This is similar to a click-through End-User License Agreement (EULA). There are strong arguments for requiring explicit consent or opt-in especially when data processing is likely to be unexpected or surprising for the data subject (Rao et al. 2016; Schaub and Cranor 2020), such as a use of data that is not readily apparent from transaction context (e.g. an LMS used to determine participation grades or what assignments to offer), but how such policies might influence trust and consent behaviors are unclear and left for future experiments.

Context Dependency and its Importance in Learning Analytics

Even with extensions to uncover detailed patterns of reasoning through consent decisions, it is important to keep in mind that the dataset used in this study consists solely of records from a single major university in the United States with many initiatives to promote ethical innovations in learning analytics. Data collected at other institutions may have different underlying distributions and lead to distinct results with different conclusions; this is why we have provided contextual information in “Recruitment & Participants”. Conducting similar analyses at community colleges or other educational settings may help generalize the results of this paper, especially as Li et al. (2019) demonstrated a need to reevaluate predictive models when the training set for predictive models is altered. A cross-cultural survey administered at institutions around the world might allow for privacy expectations to be better understood.

With that said, it is not always possible to share demographic information across institutions due to data privacy concerns, and so we might want to ensure the reliability of self-reported demographics. Yet, of the people who did not omit their response, there was a perfect agreement between the self-reports and university records for ethnicity, and a 98% agreement for gender; we expect the gender discrepancy as we supported a non-binary option in our survey, while gender in our institution’s records are still being reported dichotomously. The high levels of agreement also show that it may be possible to obtain accurate demographic information by only asking students to self report demographics without resulting in significant loss of data. Not only does this give students more agency over what they choose to share, it also suggests that one may be able to conduct similar studies requiring demographic information at other institutions, even if those details are not available to researchers or centrally stored.

Eventually, inter-institutional datasets compiled from various types of educational institutions around the world may be joined in order to enrich our understanding of consent factors as it relates to diverse sociocultural backgrounds. On the algorithmic front, one may obtain more specific estimates of data sharing ranges, and demographics can be tied with students’ responses to calculate various opt-out ranges since different subgroups have different privacy perspectives and rates of participation, as we have shown. The performance of predictive models may therefore incur greater differential effects, which will continue to necessitate further research to ensure there is a balance between maintaining trust and personal privacy while advancing education and ensuring fairness across diverse student populations. We may also obtain a greater understanding of what aspects of institutional structure and policy are most effective at fostering a culture that encourages broad participation and minimizes paternalistic policies of data collection. For instance, many institutions are hierarchical and rely on audit culture, and so differences in measures of success as well as flatter administrative structures with shared duties and codified values may have an impact on student and staff behavior as well as baseline participation rates.

Conclusion

In this study, we have addressed two critical questions regarding the use of students’ educational data in learning analytics. RQ1 asked about students’ perspectives on their educational data being used for learning analytics systems, and RQ2 sought to find the population and participation characteristics of students who indicated a preference to allow or deny such usage.

For RQ1, we have identified three primary factors: trust in the institution, concern with individual data collection, and comfort with instructor data use for student engagement that influence a student’s willingness to consent to their data being used in learning analytics. Higher levels of institutional trust and greater comfort with the idea of instructors using educational data for instructional improvement correlate with much greater probabilities that a student will allow their data to be used for learning analytics. By contrast, apprehension to personal data collection leads to a lower chance of consent.

Given these factors, we then explored RQ2 and found that students who identified as female were more likely to trust the institution and instructor data use than male students, but were also more generally concerned about data collection practices. Meanwhile, Black students indicated lower levels of institutional trust. We also note that female students had a higher response rate and that White students were overrepresented while Black students were underrepresented among people who made a consent decision.

While there are some limitations that come with survey instruments and the fact that this study was conducted at a single university, our findings surface important implications for institutions to consider when collecting data for learning analytics and we layout additional routes for confirming and generalizing the results presented here. We demonstrate that varying engagement rates reflect existing educational disparities between minority students and that instructors can influence students’ consent decisions.

The findings for RQ1 and RQ2 illustrate that it is insufficient to only provide students’ with consent prompts and expect unbiased data without a more concerted effort to involve instructors and institutional decision-makers. We support student agency and agree that collecting or using data without student consent violates that, though we also emphasize that consent is only a part of the puzzle in allowing for greater informational self-determination; it may take a combination of many other forms of agency such as transparency, data access, or opportunities to object. Balancing student agency is therefore not a straightforward matter and requires careful consideration and design.

The differential response rates we identify show that perspectives from those who are underrepresented are still not properly accounted for even when stratifying equally across groups. Additionally, the difference lies not only with consent rate, but also in the perspectives held by respective subpopulations; there are unique underlying perspectives that guide individuals’ actions from each of these groups. Therefore, relying solely on such an approach for data collection is likely to continue producing biased predictions based on biased data, influencing the efficacy of educational technology and its potential to treat students fairly. In order to ensure the ethical use of AI in education, the method of data collection is important, but it is imperative not to fixate solely on this aspect. Taking note of students’ major concerns and steps to strengthen trust in their institution’s numerous stakeholders should be addressed via tangible actions such as implementing transparent data practices; allowing inequity to continue will only serve to increase mistrust, thereby lowering engagement and affecting institutions’ ability to support all its students. Instead, by understanding the key factors that influence consent and their relation to students’ personal background, institutions will be better equipped with the knowledge to enable technology-supported education while maintaining ethical data use and public trust.