Abstract
In the near future, systems, that use Artificial Intelligence (AI) methods, such as machine learning, are required to be certified or audited for fairness if used in ethically sensitive fields such as education. One example of those upcoming regulatory initiatives is the European Artificial Intelligence Act. Interconnected with fairness are the notions of system transparency (i.e. how understandable is the system) and system robustness (i.e. will similar inputs lead to similar results). Ensuring fairness, transparency, and robustness requires looking at data, models, system processes, and the use of systems as the ethical implications arise at the intersection between those. The potential societal consequences are domain specific, it is, therefore, necessary to discuss specifically for Learning Analytics (LA) what fairness, transparency, and robustness mean and how they can be certified. Approaches to certifying and auditing fairness in LA include assessing datasets, machine learning models, and the end-to-end LA process for fairness, transparency, and robustness. Based on Slade and Prinsloo’s six principals for ethical LA, relevant audit approaches will be deduced. Auditing AI applications in LA is a complex process that requires technical capabilities and needs to consider the perspectives of all stakeholders. This paper proposes a comprehensive framework for auditing AI applications in LA systems from the perspective of learners' autonomy, provides insights into different auditing methodologies, and emphasizes the importance of reflection and dialogue among providers, buyers, and users of these systems to ensure their ethical and responsible use.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The digitalization of learning processes with the provision of online learning objects, such as interactive quizzes, videos, and texts creates massive amounts of data on learners, learning processes, and learning results. AI technologies, especially machine learning, are increasingly used to provide data-based, individualized, meaningful insights, and adaptive learning paths or recommendations. However, the ethical implications of AI, including issues of fairness, non-discrimination, transparency, and robustness are increasingly being discussed as part of the broader concept of trustworthy AI (see Sect. 3).
The quality of an AI system will need to be assessed in terms of its intended task, functionality and performance, robustness, and transparency [26, 70, 82]. All those aspects are linked to fairness. If an AI is not accomplishing its intended task (functionality and performance) or if it is designed to discriminate, it cannot be considered fair. If the AI system cannot cope with incomplete, noisy input data (robustness) or delivers completely different outcomes with small input changes, it will fail to perform accurately in uncommon or unexpected scenarios, or scenarios with extreme values which are often referred to as edge cases. If the system output is mysterious to the user, learners and instructors might draw the wrong conclusions.
This paper contributes to the research discussion in several ways:
-
1.
Providing a comprehensive framework for auditing AI applications in LA systems by proposing a four-phase audit process, which includes delimitation, the risk-based definition of audit criteria, auditing and assessment, and monitoring and re-assurance. This framework can be useful for practitioners and researchers who are interested in evaluating the ethical implications of AI systems in education.
-
2.
Focusing on the learner's perspective by emphasizing the importance of putting the learner at the center of the risk analysis, ensuring that any LA system benefits learners and does not put them at risk. This perspective is crucial, as ethical AI systems in education should prioritize the well-being and autonomy of the learners.
-
3.
Discussing auditing methodologies by clustering different methodologies for auditing AI systems into four categories, including reviewing system objectives, interventions, and consequences, reviewing datasets, analyzing source code and model quality, and conducting technical black-box testing. This discussion can be useful for researchers and practitioners who are interested in selecting appropriate auditing methods for their AI systems.
-
4.
Emphasizing the need for reflection and engaging in joint discussions among providers, buyers, and users of these systems. This emphasis on reflection and dialogue can contribute to the development of more ethical AI systems in education.
Overall, the paper provides a comprehensive framework for auditing AI applications in LA systems from the perspective of the learner's well-being and autonomy. It also offers practical insights into different auditing methodologies and emphasizes the importance of reflection and dialogue. These contributions can improve the research discussion on ethical AI systems in education and guide practitioners and researchers in designing and evaluating these systems.
The ethical implications of AI in education are increasingly being discussed in the context of trustworthy AI. In heavily regulated domains such as finance and healthcare, AI systems are subject to rigorous auditing processes. While in the European Union a draft regulation for AI is being discussed [31], no federal regulation is currently being prepared in the USA. Nonetheless, recent initiatives document the importance of the ethical use of AI on the other side of the Atlantic: in October 2022, the White House has issued a nonbinding Blueprint for an AI Bill of Rights [102]. The National Institute of Standards and Technology (NIST) issued an AI Risk Management Framework in January 2023 [75]. In China, regulation on recommender systems was enacted already in 2022 [23]. Hence, the comprehensive framework for auditing AI applications in LA systems provided in this article will be useful for practitioners and researchers interested in evaluating the ethical implications of AI systems in education in all regions, while the most urgent need is in the European Union. This includes educational institutions, educational technology providers, and policymakers involved in regulating AI in education. By following the proposed audit process and considering the domain-specific audit criteria and methodologies presented in this article, stakeholders can ensure that AI systems in education prioritize the autonomy of learners and comply with regulations.
In this article, the discussion around the fairness, transparency, and robustness of AI systems will be briefly summarized (Sect. 3), and AI auditing requirements, processes, and methodologies will be introduced (Sect. 4). Domain-specific audit criteria for ethical AI applications in LA will be presented in Sect. 5. A process for auditing AI applications in LA will be introduced in Sect. 6 and numerous, complementary auditing methodologies will be discussed.
2 Learning analytics
Over the past decade, LA research has steadily increased [89]. LA uses data from learners and learning platforms to analyze learning processes and contributes to enhancing learning [59], for example by implementing learning dashboards, predicting performance or drop-out and potentially triggering personalized interventions, or by providing personalized recommendations, feedback, or hints [4, 28, 60, 66, 74]. AI technologies such as machine learning or natural language processing are frequently used in LA [2, 3, 10, 17, 88, 91]. Fairness, transparency, and explainability are inevitable dimensions in LA, especially as learners' professional futures are at stake [35, 77, 86, 97].
3 Fairness, transparency, and robustness of AI systems
The use of AI technologies is rapidly expanding in all areas of society. However, AI technologies are associated with a number of risks, especially related to privacy, fairness, robustness, and transparency [39]. The privacy-related risks of using data-driven technologies have been widely discussed, also with regard to LA [27, 77]. Trustworthiness, including fairness, robustness, and transparency, are additional challenges when AI technologies are used for learning analytics. In the following, we will shortly summarize the discussion around fairness, transparency, and robustness of AI systems.
3.1 Fairness of AI systems
The most widespread AI systems used today are based on machine learning technologies, where relationships are abstracted into models from data (“learned”). However, if the data used to train those models contains correlations that are considered unfair or biased, the AI system may produce unfair outcomes when used in sensitive environments such as the criminal justice system or for credit decisions. AI systems replicate biases from the training data or can even amplify them, especially for underrepresented groups [108], as has been shown also for LA applications [85, 90]. Further sources of bias in AI systems result from the data selection and measurement process, data preparation, or the interaction of the user with the system [67].
Unfair AI systems can lead to direct discrimination [34], indirect discrimination [11, 34, 108], or underestimation [52]. Biased educational systems can result in allocative harm, when they prevent groups from access to education or in representational harm, when they underrecognize or stereotype certain groups [9]. Fairness in LA applications of AI is increasingly discussed [25, 55, 86]. This highlights the need for auditing methodologies of AI integrating LA systems.
3.2 Transparency of AI systems
Transparency with regard to the handling of student data has been demanded of LA systems for a long time [77]. Again, the use of massive amounts of data including log data and machine learning technology poses new challenges to the transparency of LA systems, as AI systems are not easily comprehendible. The transparency of AI systems can be assessed in three categories: traceability, explainability, and communication [30].
Traceability refers to the documentation of the AI system, including the methods used for gathering, labeling, and pre-processing data, splitting into training, test, and validation set, choice of ML model, model parameters and evaluation criteria, and hardware and software setup [30, 72].
Explainability of AI (XAI) is a field of research. Explainability in the sense of an understanding of the system is required by AI developers to improve the system, by managers to ensure proper use of the system in organizations, by AI users to validate match between input and output, by individuals affected by AI results, and by auditors to assess compliance with requirements [68]. The explanations provided need to be adjusted to the needs of those groups [68]. Broadly used approaches to improve explainability are the use of more transparent models (decision trees vs. neural networks), the assessment of feature importance [6, 63, 87], or the inspection through counterfactual examples [24, 62, 65]. However, technical model explanations were found to be of limited use to system users as explanations need to consider the process within which the system is used, the recipient’s knowledge, and domain-specific approaches to explanations [62].
Communication refers to the need to inform recipients of the system’s output adequately about the functioning of the system, as well as its limitations [30].
3.3 Robustness of AI systems
If decisions are taken based on the results or recommendations provided by a system, users expect those results not to change massively with small fluctuations in the input data. Robustness [110] refers to an AI system's ability to handle incomplete or noisy input data and still provide reliable results. In educational applications, where human assessments are often used to measure competencies, progress, effort, or output quality, robustness is particularly important since the accuracy of the input data cannot always be guaranteed [29, 48].
Some machine learning systems, such as neural networks, have been found to lack robustness [100]. Lack of robustness is often demonstrated using designed edge cases as so-called adversarial examples [38]. A trade-off has been found to exist between accuracy and robustness [103].
4 Auditing AI systems
Auditing AI systems is complex, because AI systems are part of socio-technical systems that are under continuous development [56]. Fairness, transparency, and robustness of AI systems cannot be assured using a single practice or technology, but they can be created through conscious human-centered design choices about datasets, models, and validation processes [93]. Standardized certifications by independent auditors are considered to be important factors to achieve human-centered AI systems [95].
This section is divided into three parts: first, the need for AI audits will be explained, secondly, processes for auditing AI systems will be presented, and lastly AI auditing approaches and methodologies will be discussed.
4.1 Auditing requirements
There is an ongoing global discussion about the need to regulate and audit AI systems. The European Commission has proposed the Artificial Intelligence Act [31], which, once enacted, will regulate all AI systems used in the European Union and require conformity assessments for all high-risk AI systems, including those used in education. In 2022, the Chinese Administration has issued an administrative rule on algorithmic recommender systems which aim to protect national security and “social public interests” for which providers have to document adequate audits to ensure compliance [18]. The German Federal State of Schleswig–Holstein has recently adopted a law on the requirements for AI systems used in the public sector, which includes education [49].
Finance and healthcare are heavily regulated and technology-driven sectors, as they are linked to important risks. As a result, regulation in these sectors is more advanced compared to other industries. In the Finance sector, conformity assessments have been required for some time for algorithmic trading systems in Europe according to Art. 6 when the system is used for the first time, upon material changes or prior to material updates [EU 31/589]. The regulation also requires financial institutions to separate production and testing environments for conformity assessment. For risk models in financial institutions explainability is a challenge as the relationship between input and output data cannot be described verbally or mathematically anymore, as historically required by surveillance authorities [7]. Explainability is, therefore, considered a major criterion for model selection [8]. As changes to important risk models have to be reported to and sometimes be approved by the regulating authority there is a need to define what a model change is, for example re-training based on updated data or adjustment of hyper-parameters, such as the number of layers in a neural network [7].
Auditing requirements for medical devices that use AI differ depending on the type of AI used: Static medical AI systems, that use a trained model are easier to certify than dynamic medical AI systems that regularly update machine learning models based on new data [46, 106]. The inherently higher incertitude in continuously learning medical devices needs to be justified by better functionality [106].
For other domains, such as educational technology, this means, the level of risk associated with the AI systems in use should dictate the level of explainability required. Additionally, it is important to differentiate between static models and more dynamic models that are continuously updated. Significant model updates will require re-auditing.
4.2 Auditing process
The process for auditing AI systems depends on the chosen audit methodology. In the following, we will describe processes for fairness audits [1, 83], external code audits [101], and life-cycle-based audits [36, 79]. However, as the harmonized standards referred to in the AI Act do not exist yet, none of the presented approaches can be compliant with future European regulation, yet. The final harmonized standards for AI audits are not expected to be released until 2025. The European Commission's final standardization request was approved in February 2023, and the development of standards for “Safe and trustworthy artificial intelligence systems” is now included in the 2023 annual Union work program for European standardization [32].
Agarwal et al. [1] propose a standard operating procedure for fairness certification which relies on a training dataset to calculate a bias index for the protected attributes and a fairness score for the total system. The bias index takes into account several fairness metrics and the fairness score is an average bias index [1]. Raji et al. [83] created a process framework for internal audits of AI systems consisting of the stages (1) scoping, (2) mapping, (3) artifact collection, (4) testing, (5) reflection, and (6) post-audit. During the scoping stage, the audit scope is defined and system documentation as well as underlying principles are acquired. Auditors assess the system’s social impact and ethics within use cases [83]. In the mapping stage, interviews are conducted with stakeholders to perform a failure mode and effects analysis (FMEA) that is used throughout the subsequent stages. In artifact collection information on systems and models is obtained. The testing stage includes the main auditing procedures and the review of the obtained material with regard to the identified risks using appropriate methodologies. In the last two stages, the audit results are documented and reported, and possible mitigating measures are assessed [83]. An internal AI audit is also described in [71]. The authors describe contradictory requirements within the organization and issues to operationalize scope and claims of the audit [71].
Tagharobi and Simbeck [101] introduced a process for code-based fairness audits of LA systems that consists of the steps (a) definition of the scope of the audit, (b) artifact collection and confinement, (c) mapping, description, and prioritization of relevant functions, (d) fairness assessment, and (d) interpretation of results. They find that the assessment of a system based on code alone is only possible to a limited extent if neither training data nor test data are available [101].
Other auditing approaches acknowledge the difficulties in assessing production systems and propose following the AI development lifecycle to ensure compliance [36, 79]. CapAI [36] introduces a procedure for conducting internal conformity assessments that is supposed to meet EU AIA requirements and that follows the AI lifecycle stages (design, development, evaluation, operation, retirement). For every stage, review items with corresponding documentation are defined. The procedure describes data pre-processing, splitting of data into training, validation, and test data, and proposes to use standard measures for model quality (mean squared error, mean absolute error, accuracy, precision, recall, F1-score), or fairness (such as equal opportunity, disparate impact, equalized odds, demographic parity) [36]. Given the procedure’s broad target audience, it serves more as a best practice manual for AI development. It stands out from those best practices, however, by calling for the definition, documentation, and communication of the norms and values on which the design of the AI system shall be based. It does not explain though, how the required ethical assessment against these values shall be conducted. The Fraunhofer AI assessment catalog [79] follows the AI development lifecycle as well but defines the stages of data, development, and production. Each of the dimensions of fairness, autonomy/control, transparency, reliability, security, and data protection shall be assessed based on risk analysis, adequate metrics, and measures [79]. Concerning fairness, the procedure calls for defining possibly disadvantaged groups and adequate fairness objectives in the framework of the risk analysis. Metrics are to be defined with regard to system output and data input. Measures include data pre-processing (such as preferential sampling or data reweighing [51]), post-processing of results (calibration, thresholding, transformation) results or choice of adequate models (such as re-weighing or adversarial learning).
In summary, there is agreement that the elements of scoping (delimiting the system to be audited) and the risk-based approach are required elements of an AI auditing process [1, 36, 79, 83, 101]. Many audit approaches recommend documenting the system development process [36, 71, 79, 83] which is in line with the AIA conformity assessment based on internal controls.
4.3 Auditing methodologies
In this section, several auditing methodologies will be discussed. This includes methodologies described in academic audits [21, 92], commissioned or voluntary audits [76, 109], user audits [53, 94], and non-binding proposals for standards from institutions [16, 26, 44, 46].
Ideally, complete system documentation, source code, and data are available to conduct an AI audit. According to DIN Spec 92,001 the quality of AI systems needs to be assessed on the levels of their model, data, platform, and environment [26]. Here, the “setting in which the AI module is situated and with which it can interact” is defined as its environment. The platform refers to the hardware and operating systems on which the AI is run, including its limitations induced by interfaces, processing power, or technical dependencies [26]. However, such a comprehensive approach is not always possible. Even if the source code is disclosed, the sheer size and complexity of the system might make it difficult to assess its fairness, transparency, and robustness [92]. If source code and machine learning model are not available, a black box audit can be conducted. In black-box audits, system input data is systematically varied to identify the impact of input features on system output [73]. Sandvig [92] differentiated between four black-box AI auditing approaches: noninvasive user audits (questioning users about interactions with a system), scraping audits (systematically querying a system), sock puppet audits (using test user profiles), and crowdsourced audits (collaborating with system users to systematically collect real-world system input and output).
Black-box audits have been used to assess the fairness of online delivery of job ads [47] and housing ads [5]. In both cases, discrimination was detected by running experiments based on assumed optimization algorithms used by platforms. In other cases, system output data are verified against publicly available third-party data to assess representativity and identify sampling bias [21]. Black box audits are sometimes initiated by accidental findings of system users and later crowdsourced to obtain larger datasets [94]. Examples include the investigation of the Twitter image cropping algorithm [94], or the TikTok recommendation algorithm [53].
Datasets that are not representative or biased can lead to incorrect or biased machine-learning models. Biased datasets will result in biased machine-learning models. [86] provides an example of how bias from the data is transferred into the model. The audit approach of an “extended conversation” was used in the 2020 audit of the hiring assessment provider HireVue [76]. The audit comprised a review of documentation, interviews, and discussions with stakeholders, the establishment of an ethical matrix, and the planning of remediation steps [76]. The potential fairness issues identified through the stakeholder interviews (i.e. balanced training data, accents in voice data, dealing with short answers) were explained and categorized based on information provided by HireVue [76]. The audit serves thus more as an example of the first step of an audit process – the identification of domain-specific risk areas. It is to be noted, that stakeholders included an association representing minority candidates or neuro-atypical candidates as well as a client.
A similar approach is proposed by the German Institute of Public Auditors which sets standards for public accountants in Germany in its draft auditing standard for AI systems [44]. The audit will either assess appropriateness (documented measures meet minimum criteria, are appropriate, and are implemented) or moreover effectiveness (documented measures meet minimum criteria, are appropriate, are implemented, and are effective) [44]. According to [44], auditors rely mainly on the description and documentation provided by the auditee. Important auditing procedures include the identification and interrogation of responsible persons, and the assessment of the system documentation with regard to completeness, correctness, and currentness [44]. The draft standard describes four areas of requirements for the AI system: compliance with ethical and legal requirements, comprehensibility (transparency and explainability), IT security (confidentiality, integrity, availability, authorization, authentication, binding force), and performance [44]. In order to meet the requirements, the draft standard calls for the implementation of AI governance measures, AI compliance, AI monitoring, data management, model training, AI application, and AI infrastructure [44].
In their cooperative audit of the PlyMetrics hiring tool [109]) assess for risks identified in starting step: correctness, direct discrimination, de-biasing circumvention, sociotechnical safeguards, and sound assumptions (imputation of missing values). Their audit uses mixed methods: it is primarily a code audit (manual examination of source code) that was complemented by systematic experiments to explore the possibility of de-biasing circumvention, review of documentation, and data review [109]. They reported, that the source code (Jupyter notebooks provided by PlyMetrics) correctly implements the adverse impact metric and that it did not use demographic features that could lead to direct discrimination [109]. They assessed the sociotechnical safeguards (human oversight) by understanding the data science process in the audited company [109]. To test the impact of the imputation of missing values, they analyzed the distribution of missing values among groups and tested the model for adverse impact [109].
For some application domains of AI systems, questionnaires have been proposed for auditing the systems. IG-NB [46] created a questionnaire for the certification of AI in medical products that covers the areas of supplier’s AI competency, documentation, medical purpose and context of use, risk management process, functionality and performance, user interface, security risks, data acquisition/labeling/pre-processing, model creation and assessment, and post-market surveillance. While some requirements in the questionnaire refer specifically to medical applications (i.e. product requirement definition should include indications, contraindications, comorbidities), most of the items are generic and could be applied to assessing any AI system (i.e. requirement to document feature selection and split into training/validation/test data).
The German Federal Office for Information Security published an AI Cloud Service Compliance Criteria Catalogue (AIC4) that also relies on the AI service provider’s system description [16]. The provided compliance criteria are divided into eight areas: general cloud computing compliance, security and robustness, performance and functionality, reliability, data quality, data management, explainability, and bias. In order to assess security and robustness, the catalog calls for the documentation of continuous risk management procedures against malicious attacks including scenario building, risk exposure assessment, robustness testing using for example white or black box attacks or physical attacks, and implementation of countermeasures [16]. With regards to performance assessment, specific criteria are recommended (i.e. ROC curve, AUC curve, and Gini coefficient for scoring), and AI system providers are also required to descript the capabilities and boundaries of the applied machine learning model [16]. The data quality criteria include among many others the “selection and aggregation of [data that] is statistically representative and free of unwanted bias” [16]. Specifically, equalized odds, equalized opportunity, demographic parity, and fairness through (un)awareness are listed as fairness metrics, and several methods for bias mitigation are also mentioned [16]. The level of explainability on the other hand shall depend on the purpose, potential damages, and decision-making implications [16].
In summary, the different auditing methodologies are partly complementary, as they provide different insights into the system. On the other hand, not every methodology can be applied in every audit, depending on the availability of system access, extensive documentation, or source code (see Table 1).
5 Audit criteria for ethical LA
In their seminal work, Slade and Prinsloo [97] describe six principles for ethical LA which will serve as a base in the following section to deduce domain-specific audit criteria for AI applications in Learning Analytics. Figure 1 gives an overview of the six principles and the derived audit criteria. The principles proposed by Slade and Prinsloo [97] are widely recognized in the field of LA and are often used as a framework for the ethical use of data in education as they prioritize the purpose of education [43].
The first principle demands LA to be a “moral practice”, which means that it should not follow a technocratic efficiency focussed approach [97]. Education requires value-based, normative judgments that cannot be replaced by data-driven decision-making [13]. Any decision-making in education needs to inquire first about what is educationally desirable [13]. The first principle thus leads to audit criteria such as:
-
What educational theories/values/didactic assumptions is the system/intervention based on?
-
Are the educational objectives of the system/intervention defined?
-
Does the system help to understand learning (process/effort/success/success factors) rather than just measuring it?
The second principle calls for seeing “students as agents” of their learning process, not only as recipients of learning interventions that generate data [97]. This implies the necessity of informed consent for the use of LA systems but also an inquiry into students’ educational priorities and challenges [57]. The need for explainability which can be deducted from the second principle is closely related to the fifth (transparency) principle. The second principle leads to audit criteria such as:
-
Did learners consent to the specific use of data?
-
Are drivers of system results understandable to learners/instructors/institution?
-
Are implications/actions from system results understandable to learners?
-
Do learners get the chance to explain themselves/adapt/correct system output and/or implications?
The third principle underlines that “student identity and performance as temporal dynamic constructs” [97]. Even though some competencies can be frustratingly stable and slow to acquire, education mostly takes place in phases of life that are associated with rapid personal development and frequent changes of context and setting. This also implies that learners are in a critical phase of their life and must therefore be protected from discrimination and unfair treatment. For the use of data in LA, the following questions must be answered:
-
Is the timespan of data used for analysis or prediction adequate?
-
Is newer data weighted stronger?
-
Are system results fair and unbiased with regard to minority groups?
-
Is the data used balanced with regard to data from minority groups?
-
Will data/system outputs be deleted or anonymized after an appropriate timespan?
-
Can learners request the deletion of data?
-
Will past data on learning performance or process permanently limit learning opportunities?
The fourth principle states that “student success is a complex and multidimensional phenomenon” [97]. Digital LA systems are biased towards using digitally available data and ignoring important success factors of learning, because they are either not easily available (i.e. socio-demographic background) or not digitally captured (offline learning activities such as reading books, discussing with friends). An ethical review of an LA system, therefore, needs to address the following questions:
-
How have the data points been chosen?
-
Which data is potentially missing because it is not (digitally) available?
-
Is it made transparent which dimensions of student effort/progress/success cannot be measured?
The fifth ethical principle in LA is the principle of transparency [97], also required by Pardo and Siemens [77]. It includes transparency about the purpose and conditions of data use and access to data and results [97]. The notion of transparency is closely related to explainability, which is covered by the second principle (learners as agents of their learning). Therefore, the following questions, have to be answered to assess an LA system:
-
Are learners informed about which data is processed by the system?
-
Are learners informed about system objectives?
-
Is system output shared with learners, or only with teachers or institution?
In contrast to the previous five principles which set limits for LA, the sixth principle “higher education cannot afford to not use data” encourages the adoption of LA practices [97]. Data can and shall be used for the benefit primarily of learners, but also of instructors, and educational institutions to provide meaningful learning experiences and individualized learning support with limited resources. Data-based analytical approaches can also help to understand and mitigate biases that exist in society, i.e. by identifying underserved groups of students or discrimination in opportunities or results. Some authors argue, that educational institutions should therefore strive to systematically collect data on minorities and analyze it to identify and prevent discrimination [9]. However, this contradicts the established principle of data minimization, which is one of the basic principles of data protection and is required by the European General Data Protection Regulation (GDPR).
This leads to the following audit question of an LA system:
Based on the educational objectives and approaches of the institution, is LA used where needed?
6 Auditing AI in LA systems
The creation and continuous development of educational AI systems is a complex multi-step process, where every step can potentially introduce or amplify bias [9]. The creation of a robust and transparent system requires dedicated effort at every step of this process. In the following, a process for auditing the fairness, transparency, and robustness of AI applications in the LA domain will be presented (Fig. 2). While the terms auditing, assurance, and certification are all used in the literature, in the following only the term audit will be used. The proposed process integrates the ISO standard for auditing management systems [48] with prior works by [83, 101].
6.1 Delimitation and definition of scope
The first step of the audit process consists of the delimitation and definition of the scope of the system to be audited [71]. In many cases, the AI technology may be part of a broader system that is not completely subject to the audit. This includes delimiting the scope of the relevant system components, the relevant input data, the output data of the system, and, if possible, the identification of the processing methodology of the data within the system. In ISO 19011 [48] , this step is part of establishing and implementing an audit program.
Identification and documentation of system scope
Before beginning the audit, it is essential to define the scope of the system to be audited, especially if the system is embedded in a more comprehensive system or scope. This includes defining the start and the end of the relevant processes. For this purpose, it is necessary to obtain access to documentation, process descriptions, and possibly source code. Traceability in the broadest sense is not a pre-requisite for all types of audits, audits with lower levels of assurance can be performed without full insight into data, code, and models. In many cases, it will be impossible to analyze such extensive information.
In the automotive industry, the scope of a system risk assessment is described as the Operational Design Domain (ODD) and describes the conditions under which a system is expected to be reliable, for example in terms of geography or weather [61]. In their audit of the LA functionality of the Moodle learning management system, Tagharobi and Simbeck [101] delimited the scope to only 35 thousand of a total of 2.6 million lines of code of the complete application.
Identification and documentation of input and output data
Software applications use databases to store and retrieve data. In complex applications, many variables of different types are stored in columns of interrelated tables in databases. Machine learning applications access those databases to feed data into the trained models and store the output back into the database. Not all data columns may be relevant to the scope of the audit. To assess the AI-based LA application, it is necessary to understand and document which variables are used as input to the system, which output is created by the system, and where it is stored. Usually, data need to be pre-processed before it can be used for machine learning. The applied data pre-processing shall be documented, for example, data cleaning or creation of additional calculated or aggregated attributes. It is also important to understand and document when, where, and how the system output is displayed and to whom. A dropout prediction in a learning management system could be initiated by an admin, the aggregated result be visible to a teacher, and the individual result could be visible to learners. Apart from the aggregated and individual results, learners and teachers might receive further information about the quality of the prediction or be able to compare themselves to other groups.
Identification and documentation of processing methodology
A machine-learning based AI system applies a trained model to new data. In some cases, the trained model may be built into the system, in other cases, users may be able to train their own models. For training the model, numerous approaches can be used, from linear regression to deep neural networks. In some cases, system users may be able to choose a model-building approach. Auditors should identify and document how the data is processed in the system and how the model is trained. If a pre-trained model is used the available quality measures of the model and the used training data need to be documented.
6.2 Risk-based definition of audit criteria
Once the system to be audited has been delimited and access to relevant stakeholders, documentation, data, system, and possibly source code has been acquired, the audit objectives, risks, and audit criteria need to be identified and documented. This step is also part of establishing and implementing an audit program in [48].
Identification of risks
To systematically assess the risks of the system, auditors have to consider the perspective of all stakeholders and identify what could go wrong, starting with the consequences of the system output. Interviews with diverse stakeholders can serve to identify domain-specific risk areas and describe cases of potential negative consequences from system failure [76]. For medical software, risk analysis focuses on reliability (of intended functionality) and safety (avoidance of unintended harm) [20]. Classical other risk categories are damage to property or financial assets, injury, revealing of personal data, and negative consequences from non-availability of the system or failure of the system.
To identify and describe domain-specific risks, the six principles of ethical LA as described by [97] can be used as a starting point (Fig. 1).
Risk-based definition of testing scenarios/edge cases/corner cases
In the next step, the identified risks can be translated into testable scenarios. Scenarios are combinations of input factors that may occur and influence the system output [107]. Testing scenarios should include both usual and frequent scenarios and unusual or unexpected scenarios. The technique of creating Personas [80] helps to identify and cover typical system usage. Often, edge cases or corner cases are used that describe scenarios with low probabilities but possibly severe consequences [14, 40, 83]. Kitto and Knight [54] promote the use of edge cases for assessing ethical implications of AI use in LA and discuss three example cases (prior consent, conflicting claims in collaborative learning, and profiling. Another approach to identifying relevant scenarios is the use of acceptance tests (compliance with requirements) and assurance cases (structured claims that can be proven based on evidence) [41, 84, 98].
Mapping of risks to audit criteria
The risks identified in the prior steps need to be mapped to testable criteria. ISO 19011 [48] explains: “The audit criteria are used as a reference against which conformity is determined. These may include (…) applicable policies, processes, procedures, performance criteria including objectives, statutory and regulatory requirements, management system requirements, information regarding the context and the risks and opportunities as determined by the auditee (…), sector codes of conduct (…).”
6.3 Methodologies for auditing AI-based LA systems
In this section, an overview of auditing approaches for AI systems will be presented and discussed with regard to their applicability in an LA context. Not all audit approaches can and should be applied in all audits. Auditors need to select appropriate methodologies based on the identified risks, system complexity, access to system, code, and documentation, and their own competencies (Table 1).
Some of the auditing methodologies to be discussed require extensive access to confidential data, source code, and machine learning models ("review of datasets" and "code analysis and review of model quality"). These approaches allow for more detailed audits. Without such access, only methodologies "review of system objectives, interventions, consequences" and "black box testing" are available. However, even with full access to system and data, the complexity of many AI systems often still renders them incomprehensive [81, 96]. Therefore, the risk analysis step is crucial to identify relevant risks and determine the appropriate audit methodology. Ultimately, it is up to the stakeholders to weigh the potential benefits of a system against the risks and decide whether or not to use it.
The auditing methodologies can be combined. In any case, auditors need to focus on the risks defined in the prior step during the audit and document their work. The documentation of the audit needs to include information on the auditing methodologies applied, why they were chosen, and which risks and benefits are associated with the choice of methodology. The audit documentation should also state under which conditions the audit would be reproducible.
Review of system objectives, interventions, and consequences
The first auditing methodology is to systematically review the system objectives, interventions, and consequences. This can be done using system/process documentation, websites, trying out the system, and or interviews with system provider and users. If process documentation does not exist yet, it should be created during the audit in order to document every step that is taken using the system. A similar approach is described in [76] and has yielded interesting results.
The objectives, interventions, and consequences of the system have to be identified and documented. Here, objectives refer to the purpose of the system: why is the system used and what is supposed to be achieved using the system? The system objectives should be evaluated with regard to the identified risk areas. Unless aspects of fairness, transparency, and robustness are formulated as objectives, they are probably not achieved. Interviews with developers can help to understand whether different groups of learners (gender diversity, migration/education background, language, age, disabilities) were included in the development process.
Interventions are the activities performed (i.e. assigning learning resources, access to data about the learning process) or recommended (i.e. consultation sessions) by the system. Interventions of the system with regard to all stakeholders should be reviewed (learners, teachers, and organization). Screenshots of the system output can help to assess transparency. Possible audit criteria are:
-
Are system results provided to learners or only to teachers?
-
Are system results appropriately explained?
-
Do learners/teachers receive the opportunity to overwrite results or opt out of the system?
-
Are autonomy and control of learners respected?
Those interventions then result in consequences outside of the AI application. Interventions and consequences for all stakeholders should be considered. Consequences for learners could be the motivational effect of feedback or lack thereof, increased/decreased time spent learning, more efficient learning process. As a consequence of system recommendations about learners, teachers could be influenced in their grading decisions. Consequences should be discussed both with regard to the objectives of the system and with regard to unintended side effects. A system can only be fair if the intended objectives of the system are achieved through the interventions and result in intended consequences.
The advantage of this auditing approach is that the holistic effect of the system is considered. However, it is not considered how this effect is created technically. Using documentation and interviews to audit objectives, interventions, and consequences the system can hardly be assessed with regard to robustness. Rare cases of dysfunctionality can probably not be identified.
Review of datasets
One important aspect of auditing these systems is to review the datasets used for training and applying machine learning models. This is because biased datasets can lead to biased machine-learning models, which can have negative consequences for the learners and the educational institutions that use these systems. Therefore, auditors need to understand the importance of reviewing datasets and the criteria that they should consider when conducting such reviews. In this section, we will delve into this critical auditing methodology and provide insights into why it is necessary for ensuring the fairness and accuracy of AI components in LA systems.
This approach requires data science capabilities and access to system developers and/or datasets.
The first area to be reviewed is data selection. There is a bias toward using data that is easily available, but datasets that are easily available are often not representative of the target population [93]. To assess how representative a dataset is, its properties need to be compared to the target group’s properties, for example in terms of demographics. 22 have used administrative data (voter roll data) to assess the coverage of mobility data used for pandemic decision-making and found that vulnerable groups (elderly people, minorities) are underrepresented. Biased datasets will result in biased machine-learning models. Bias in the data used for training educational AI systems on one hand reflects historical biases and norms in society (“girls underperform in science”) as well as population distributions [9]. On the other hand, systems themselves influence behaviors and the data created by systems is biased towards digitally observable behaviors, thus missing offline learning or offline social networks. In the pre-processing step, data need to be cleaned, features selected, added, and calculated. This step can also be used to improve the fairness of the dataset. If subgroups are underrepresented in the training data, subgroup accuracy can be systematically higher or lower [86].
This results in the audit criteria with regard to representativeness:
-
Is the dataset used for training representative?
-
Is the dataset used for training suitable for the desired application?
-
Has the data been adjusted during pre-processing to reduce bias?
-
Are the properties of the training/validation/test data similar?
-
Are protected groups well represented in the data?
-
Are rare cases well represented in the data?
-
The used datasets should also be reviewed with regard to privacy and transparency:
-
Are learners/teachers/users informed that their data is used and for which purpose?
-
Has informed consent been given to the use of the data for this specific purpose?
-
Has data been anonymized?
Especially in a learning context, measures of interest are not directly observable (i.e. competency levels, learning progress, quality of teaching) and must be operationalized through constructs of other variables. The ability to objectively measure learning behaviors, outcomes, and competencies is limited, leading to measurement, annotation, and documentation biases [9]. The fairness of LA systems’ outcome will be negatively affected if the constructs used do not meet the requirements of validity and reliability [50]. According to quantitative social science theory, validity means that a construct measures what it is intended to measure whereas reliability refers to the notion of reproducibility and consistency.
-
Which constructs are used as input and output variables?
-
How valid and reliable are those constructs?
The review of datasets can give helpful insights into the potential fairness of the system. However, the review of datasets alone does not allow for assessing the real-world consequences of unfair data. A more holistic picture will thus require this methodology to be combined with a review of system objectives, interventions, and consequences. The review of datasets will also not provide information about the transparency or robustness of the system.
Code audit and review of model quality
In this section, two interrelated technical approaches to system audits are discussed: code audits and reviews of model quality. In the context of auditing AI components in LA systems, understanding and reviewing the technical approaches such as code audits and reviews of model quality are critical. These approaches provide auditors with essential tools and methods to assess the implementation, suitability, and quality of the software and machine learning models used in these systems. A source code audit (also referred to as static code analysis) aims to verify the implementation and suitability of the software for intended use, review conformance, and/or identify defects or vulnerabilities by systematically reading and understanding the source code [37, 45, 73]. Software products consist of hundreds of thousands of lines of code that are interrelated, created by teams of developers, and difficult to comprehend. According to [45], an auditor can go through up to 200 lines per hour. Often, a code audit can be supported by specialized tools [64]. Code analysis is not only very time-consuming, but it is also difficult to assess complex system results that emerge at the interplay of system and data [96].
The static code analysis is complemented by dynamic testing, where the software is systematically executed to identify issues, usually based on designed test cases [73]. Other forms of software reviews are management reviews (reviews of the software development process), technical reviews, inspections (visual examination), and walk-throughs [45, 73].
Code analysis is used for example in the HR tool audit by [109] where they manually review Jupyter notebooks provided by the company. A code audit for a LA application is performed by [101]. Their proposed framework consists of the steps definition of the scope of the audit, artifact collection and confinement of source code, mapping/description/prioritization of classes and functions, fairness assessment, and results interpretation [101].
The review of model quality is closely related to the code audit as it may include reviewing the source code for generating the model. The review can include the data pre-processing, the split of data into training, validation, and test data, the selection of the machine learning approach and model (i.e. linear regression or neural networks), optimization criteria, or indicators selected for model performance evaluation.
Most fairness measures are based on the confusion matrix (true positives/false positives/true negatives/false negatives) that is used to assess model quality [105]. Commonly used measures are equality of opportunity, predictive equality, and predictive parity [105]. Those fairness measures are also relevant in a LA context [86].
The review of the machine learning model helps to judge the potential transparency of the system, as some models (decision trees, regression) are easier to interpret than others as they yield interpretable weights or rules [62, 65]. For other models, feature importance measures can be calculated or they can be approximated to decision trees [62, 65]. Several toolkits are available to provide explainability, such as LIME [87], SHAP [63], or SAGE [22]. The application of those tools can, however, yield inconsistent results, therefore the appropriate approach needs to be selected carefully [99]. Further, explainability can only be judged from the perspective of the user [19]. A per se explainable model selection only increases explainability if the decision parameters are shown on the user interface level.
Black box testing
Several black box audit techniques can be used to systematically test the system using test input data. Those include equivalence partitioning, boundary value analysis, cause-effect graphing, error-guessing [73], assurance cases, acceptance tests [41], or the generation of adversarial examples [100].
Synthetic data can be created to test model properties [42], especially fairness and robustness. 108 (2021) in their audit of the PlyMetrics hiring tool systematically generated training data to explore scenarios to circumvent the tool’s de-biasing mechanism. To test the impact of the imputation of missing values, they analyzed the distribution of missing values among groups and tested the model for adverse impact [109]. Synthetic data can be especially useful in data protection sensitive fields such as LA [12].
Independently of the data used and the technical implementation of the model and system, acceptance tests and assurance cases can be used to systematically compare system functionality to expectations [41]. In acceptance test-driven development (ATDD), concrete, unambiguous, observable acceptance criteria are specified to describe the wanted behavior of the system [41]. While ATDD is a software engineering paradigm, the concept of assurance cases is used in safety and security norms such as ISO 26262. An assurance case can be defined as “a structured argument, supported by evidence, intended to justify that a system is acceptably assured relative to a concern (such as safety or security) in the intended operating environment” [78]. To apply acceptance tests and assurance cases to audit fairness, the audit criteria identified in 0 need to be substantiated in acceptance tests and assurance cases [41].
Adversarial examples are data points (such as images or voice inputs) that are not correctly classified by different machine learning models [100]. Those data points differ only slightly, often imperceptibly to humans, from correctly classified examples [38, 100]. Adversarial examples can be used to attack machine learning models but also to assess the robustness of models [58]. Adversarial testing has also been used to generate test cases for autonomous driving [104] or for security tests [15]
6.4 Monitoring and re-assurance after system changes
After deployment, machine learning models need to be monitored as models may degrade over time and may need retraining based on new/current data. After those system changes, re-assurance may be required [20, 33]. The proposed European AI Act [31, 69] requires providers of AI systems to conduct postmarket monitoring to ensure ongoing compliance.
7 Conclusion
With the growing use of AI in education and the public discussion around ethical AI systems, educational institutions adopting AI technologies will feel the need to assure their fairness, transparency, and robustness. Legislators in several countries start to regulate AI [18, 31, 49]. The possible AI regulation in Europe with its conformity assessment requirement might set standards well beyond Europe. Behind this background, this article provides an overview of how AI can be audited, specifically in the context of LA.
It is proposed to derive domain-specific audit criteria for AI Applications in LA systems from the six principles of ethical LA systems [97]. Using this approach, the learner is put at the center of the risk analysis, as any LA system shall benefit learners and not put them at risk. It is further proposed, that the audit process of AI applications in LA systems shall comprise the four discussed phases of Delimitation, Risk-based definition of audit criteria, Auditing and assessment, and Monitoring and re-assurance.
Several methodologies can be applied for conducting the audit, depending on the risks identified, the access to stakeholders, source code, and documentation as well as data, and the capabilities of the reviewers. The auditing methodologies discussed are clustered into
-
Review of system objectives, interventions, consequences,
-
Review of datasets,
-
Code analysis and review of model quality, and
-
Technical black box testing.
The auditing of AI systems is an emerging topic. There will be a need to qualify not only auditors but also instructors and educational administration/management to assess educational AI systems. However, apart from assessing systems, educational institutions will also choose to apply other risk mitigation measures, such as contractual guarantees with data/software suppliers, service level agreements, or the use of more transparent infrastructure (i.e. open source systems).
Providers, buyers, and users of educational AI systems will have to start reflecting on the ethical dimensions and implications of their systems, especially concerning fairness, transparency, and robustness and jointly discuss how assurance of those is possible in the future.
Change history
24 May 2023
A Correction to this paper has been published: https://doi.org/10.1007/s43681-023-00301-9
References
Agarwal, A., Agarwal, H., Agarwal, N.: Fairness Score and process standardization: framework for fairness certification in artificial intelligence systems. AI Ethics (2022). https://doi.org/10.1007/s43681-022-00147-7
Alam, A.: Should robots replace teachers? Mobilisation of AI and learning analytics in education. In 2021 International Conference on Advances in Computing, Communication, and Control (ICAC3). IEEE, 1–12 (2021). https://doi.org/10.1109/ICAC353642.2021.9697300
Alblawi, A. S., Ahmad, A. A.: Big data and learning analytics in higher education: Demystifying variety, acquisition, storage, NLP and analytics. In 2017 IEEE Conference on Big Data and Analytics (ICBDA). IEEE, 124–129 (2017). https://doi.org/10.1109/ICBDAA.2017.8284118
Aldowah, H., Al-Samarraie, H., Fauzy, W.M.: Educational data mining and learning analytics for 21st century higher education: A review and synthesis. Telemat Inform. 37, 13–49 (2019). https://doi.org/10.1016/j.tele.2019.01.007
Ali, M., Sapiezynski, P., Bogen, M., Korolova, A., Mislove, A., Rieke. A.: Discrimination through Optimization. Proc. ACM Hum.-Comput. Interact. 3, CSCW, 1–30 (2019). https://doi.org/10.1145/3359301
Arya, V., Bellamy, R. K. E., Chen, P. Y., Dhurandhar, A., Hind, M., Hoffman, S. C., Houde, S., Liao, Q. V., Luss, R., Mojsilović, A., Mourad, S., Pedemonte, P., Raghavendra, R., Richards, J., Sattigeri, P., Shanmugam, K., Singh, M., Varshney, K. R., Wei, D., Zhang, Y.: One explanation does not fit all: a toolkit and taxonomy of AI explainability techniques. (2019). https://doi.org/10.48550/arXiv.1909.03012
BaFin: Maschinelles Lernen in Risikomodellen – Charakteristika und aufsichtliche Schwerpunkte (Konsultationspapier) (2021). Retrieved 7.4.22 from https://www.bundesbank.de/resource/blob/670944/dc2910d45779a682010ddd125ed66056/mL/2021-07-15-ml-konsultation-data.pdf
BaFin: Maschinelles Lernen in Risikomodellen – Charakteristika und aufsichtliche Schwerpunkte. Antworten auf das Konsultationspapier (2022). Retrieved 7.4.22 from https://www.bundesbank.de/resource/blob/832120/098e427a1944db71a90afc0d46781172/mL/2022-02-18-ml-konsultation-ergebnisse-data.pdf
Baker, R.S., Hawn, A.: Algorithmic bias in education. Int J Artif Intell Educ (2021). https://doi.org/10.1007/s40593-021-00285-9
Baker, R. S., Martin, T., Rossi, L.M.: Educational Data Mining and Learning Analytics. In The Handbook of Cognition and Assessment, André A. Rupp and Jacqueline P. Leighton, Eds. John Wiley & Sons, Inc, Hoboken, NJ, USA, 379–396 (2019). https://doi.org/10.1002/9781118956588.ch16
Barocas, S., Selbst, A.D.: Big data’s disparate impact. Calif. L. Rev. 104, 671 (2016)
Berg, A. M., Mol, S. T., Kismihók, G., Sclater, N.: The role of a reference synthetic data generator within the field of learning analytics. Learning Analytics 3, 1 (2016). https://doi.org/10.18608/jla.2016.31.7
Biesta, G.: Why “What Works” Won’t Work: evidence-based practice and the democratic deficit in educational research. Educ. Theory 57(1), 1–22 (2007). https://doi.org/10.1111/j.1741-5446.2006.00241.x
Bolte, J. A., Bar, A., Lipinski, D., Fingscheidt, T.: Towards Corner Case Detection for Autonomous Driving. In 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 438–445 (2019). https://doi.org/10.1109/IVS.2019.8813817
Brubaker, C., Jana, S., Ray, B., Khurshid, S., Shmatikov, V.: Using Frankencerts for Automated Adversarial Testing of Certificate Validation in SSL/TLS Implementations. In 2014 IEEE Symposium on Security and Privacy. IEEE, 114–129 (2014). https://doi.org/10.1109/SP.2014.15
BSI.: AI Cloud Service Compliance Criteria Catalogue (AIC4) (2021). Retrieved 12.4.22 from https://www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/CloudComputing/AIC4/AI-Cloud-Service-Compliance-Criteria-Catalogue_AIC4.pdf.
Buckingham, S.J., Shum and Rosemary Luckin.: Learning analytics and AI: Politics, pedagogy and practices. Br. J. Edu. Technol. 50(6), 2785–2793 (2019). https://doi.org/10.1111/bjet.12880
China.: Internet Information Service Algorithm Recommendation Management Regulations (2022). Retrieved January 19, 2022 (via Google Translate) from http://www.cac.gov.cn/2022-01/04/c_1642894606364259.htm.
Chitti, M., Chitti, P., Jayabalan. M.: Need for Interpretable Student Performance Prediction. In 2020 13th International Conference on Developments in eSystems Engineering (DeSE). IEEE, 269–272 (2020). https://doi.org/10.1109/DeSE51703.2020.9450735
Cooper, J.G., Pauley, K.A.: Healthcare Software Assurance. AMIA Ann. Symp. Proc. 2006, 166–170 (2006)
Coston, A., Guha, N., Ouyang, D., Lu, L., Chouldechova, A., Ho, D. E.: Leveraging Administrative Data for Bias Audits. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, New York, NY, USA, 173–184 (2021). https://doi.org/10.1145/3442188.3445881
Covert, I., Lundberg, S. M., Lee, S.: Understanding Global Feature Contributions With Additive Importance Measures. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020). (2020) Online, 6–12 December 2020, Hugo Larochelle, Ed. Advances in neural information processing systems, 33. Curran Associates Inc, Red Hook, NY
Cyberspace Administration of China.: The State Internet Information Office and other four departments issued the "Internet Information Service Algorithm Recommendation Management Regulations" (2022). Retrieved January 19, 2022 (via Google Translate) from http://www.cac.gov.cn/2022-01/04/c_1642894606258238.htm
Dahm, M., Dregger, A.: Der Einsatz von künstlicher Intelligenz im HR: Die Wirkung und Förderung der Akzeptanz von KI-basierten Recruiting-Tools bei potenziellen Nutzern. In Arbeitswelten der Zukunft: Wie die Digitalisierung unsere Arbeitsplätze und Arbeitsweisen verändert, Burghard Hermeier, Thomas Heupel and Sabine Fichtner-Rosada, Eds. Springer Fachmedien Wiesbaden, Wiesbaden, 249–271 (2019). https://doi.org/10.1007/978-3-658-23397-6_14
Deho, O.B., Zhan, C., Li, J., Liu, J., Liu, L., Le, T.D.: How do the existing fairness metrics and unfairness mitigation algorithms contribute to ethical learning analytics? Br. J. Edu. Technol. (2022). https://doi.org/10.1111/bjet.13217
DIN. DIN SPEC 92001–1:2019–04, Künstliche Intelligenz_- Life Cycle Prozesse und Qualitätsanforderungen_- Teil_1: Qualitäts-Meta-Modell; Text Englisch. Beuth Verlag GmbH, Berlin. Retrieved from
Drachsler, H., Greller, W.: Privacy and analytics: it's a DELICATE issue a checklist for trusted learning analytics. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge - LAK '16. ACM Press, New York, New York, USA, 89–98 (2016). https://doi.org/10.1145/2883851.2883893
Erik Duval.: Attention please! In Proceedings of the 1st International Conference on Learning Analytics and Knowledge. ACM, New York, NY, USA, 9–17 (2011). https://doi.org/10.1145/2090116.2090118
European Commission.: EU 2017/589 Commission Delegated Regulation (EU) 2017/589 of 19 July 2016 supplementing Directive 2014/65/EU of the European Parliament and of the Council with regard to regulatory technical standards specifying the organisational requirements of investment firms engaged in algorithmic trading (2017)
European Comission: Ethics Guidelines for trustworthy AI (2019). Retrieved 8.4.22 from https://ec.europa.eu/futurium/en/ai-alliance-consultation/guidelines.1.html
European Commission: Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) And Amending Certain Union Legislative Acts (2021)
European Commission.: The 2023 annual Union work programme for European standardisation (2023)
FDA.: General Principles of Software Validation; Final Guidance for Industry and FDA Staff (2002). Retrieved March 23, 2022 from https://www.fda.gov/media/73141/download.
Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian (2015). Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY.
Ferguson, R.: Ethical Challenges for Learning Analytics. Learning Analytics 6, 3 (2019). https://doi.org/10.18608/jla.2019.63.5
Floridi, L., Holweg, M., Taddeo, M., Silva, J.A., Mökander, J., Wen, Y.: capAI - A procedure for conducting conformity assessment of AI systems in line with the EU artificial intelligence Act. SSRN J (2022). https://doi.org/10.2139/ssrn.4064091
Gomes, I., Morgado, P., Gomes, T., Moreira, R.: An overview on the Static Code Analysis approach in An overview on the Static Code Analysis approach in Software Development (2009)
Goodfellow, I. J., Shlens, J., Szegedy, C.: Explaining and Harnessing Adversarial Examples (2014)
Hagendorff, T.: The ethics of ai ethics: an evaluation of guidelines. Mind. Mach. 30(1), 99–120 (2020). https://doi.org/10.1007/s11023-020-09517-8
Hall, B., Driscoll, K.: Distributed System Design Checklist. (2014) NASA/CR–2014–218504
Hauer, M. P., Adler, R., Zweig, K.: Assuring Fairness of Algorithmic Decision Making. In 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 110–113 (2021). https://doi.org/10.1109/ICSTW52544.2021.00029
Hittmeir, M., Ekelhart, A., Mayer, R.: On the utility of synthetic data. In Proceedings of the 14th International Conference on Availability, Reliability and Security. ACM, New York, NY, USA, 1–6 (2019) https://doi.org/10.1145/3339252.3339281
Holmes, W., Porayska-Pomsta, K., Holstein, K., Sutherland, E., Baker, T., Shum, S.B., Santos, O.C., Rodrigo, M.T., Cukurova, M., Bittencourt, I.I., Koedinger, K.R.: Ethics of AI in education: towards a community-wide framework. Int J Artif Intell Educ 32(3), 504–526 (2022). https://doi.org/10.1007/s40593-021-00239-1
IDW.: Entwurf eines IDW Prüfungsstandards: Prüfung von KI-Systemen (IDW EPS 861 (02.2022)) (2022). Retrieved from https://www.idw.de/blob/134852/bf9349774314723f6246ba73fefc491f/idw-eps-861-02-2022-data.pdf
IEEE.: IEEE Standard for Software Reviews and Audits, 1028. IEEE, Piscataway, NJ, USA. Retrieved from
IG-NB.: Fragenkatalog „Künstliche Intelligenz bei Medizinprodukten“ Version 3 (2021). Retrieved 12.4.22 from https://www.ig-nb.de/dok_view?oid=861877.
Imana, B., Korolova, A., Heidemann, J.: Auditing for Discrimination in Algorithms Delivering Job Ads. In Proceedings of the Web Conference 2021. ACM, New York, NY, USA, 3767–3778 (2021). https://doi.org/10.1145/3442381.3450077
ISO: Guidelines for auditing management systems, 19011:2018
ITEG Schleswig-Holsteinischer Landtag.: ITEG: Gesetz über die Möglichkeit des Einsatzes von datengetriebenen Informationstechnologien bei öffentlich-rechtlicher Verwaltungstätigkeit (IT-Einsatz-Gesetz – ITEG). ITEG (2022)
Jacobs, A. Z., Wallach, H.: Measurement and Fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, New York, NY, USA, 375–385 (2022). https://doi.org/10.1145/3442188.3445901
Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl Inf Syst 33, 1–33 (2012)
Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-Aware Classifier with Prejudice Remover Regularizer. In Machine learning and knowledge discovery in databases. European conference, ECML PKDD 2012 proceedings, part II. Lecture notes in computer science Lecture notes in artificial intelligence, 7524. Springer, 35–50 (2012). https://doi.org/10.1007/978-3-642-33486-3_3
Karizat, N., Delmonaco, D., Eslami, M., Andalibi, N.: Algorithmic Folk Theories and Identity: How TikTok Users Co-Produce Knowledge of Identity and Engage in Algorithmic Resistance. Proc. ACM Hum.-Comput. Interact. 5, CSCW2, 1–44 (2021). https://doi.org/10.1145/3476046
Kitto, K., Knight, S.: Practical ethics for building learning analytics. Br. J. Edu. Technol. 50(6), 2855–2870 (2019). https://doi.org/10.1111/bjet.12868
Kizilcec, R. F., Lee, H.: Algorithmic Fairness in Education (2020). https://doi.org/10.48550/arXiv.2007.05443
Krafft, T. D., Reber, M., Krafft, R., Coutrier, A., Zweig, K. A.: Crucial Challenges in Large-Scale Black Box Analyses. In Advances in Bias and Fairness in Information Retrieval, Ludovico Boratto, Stefano Faralli, Mirko Marras and Giovanni Stilo, Eds. Communications in Computer and Information Science. Springer International Publishing, Cham, 143–155 (2021). https://doi.org/10.1007/978-3-030-78818-6_13
Kruse, A., Pongsajapan, R.: Student-centered learning analytics (2012)
Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial Machine Learning at Scale (2016)
Lang, C., Siemens, G., Wise, A. F., Gašević, D., Merceron, A. Eds.: The Handbook of Learning Analytics (2022)
Larrabee Sønderlund, A., Hughes, E., Smith, J.: The efficacy of learning analytics interventions in higher education: a systematic review. Br. J. Edu. Technol. 50(5), 2594–2618 (2019). https://doi.org/10.1111/bjet.12720
Lee, C.W., Nayeer, N., Garcia, D.E., Agrawal, A., Liu, B.: Identifying the operational design domain for an automated driving system through assessed risk. In: 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 1317–1322. IEEE, 19 Oct 2020. https://doi.org/10.1109/IV47402.2020.9304552
Liao, Q. V., Gruen, D., Miller, S.: Questioning the AI: Informing Design Practices for Explainable AI User Experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, 1–15 (2020). https://doi.org/10.1145/3313831.3376590
Lundberg, S. M., Lee, S. I.: A unified approach to interpreting model predictions. In Advances in neural information processing systems 30. 31st Annual Conference on Neural Information Processing Systems (NIPS 2017) : Long Beach, California, USA, 4–9 December 2017, Ulrike v. Luxburg, Isabelle Guyon, Samy Bengio, Hanna Wallach, Rob Fergus, S. V. N. Vishwanathan and Roman Garnett, Eds. Curran Associates Inc, Red Hook, NY.
Mantere, M., Uusitalo, I., Roning, J.: Comparison of Static Code Analysis Tools. In 2009 Third International Conference on Emerging Security Information, Systems and Technologies. IEEE, 15–22 (2009) https://doi.org/10.1109/SECURWARE.2009.10
Markus, A.F., Kors, J.A., Rijnbeek, P.R.: The role of explainability in creating trustworthy artificial intelligence for health care: A comprehensive survey of the terminology, design choices, and evaluation strategies. J Biomed Inform 113, 103655 (2021). https://doi.org/10.1016/j.jbi.2020.103655
Matcha, W., Uzir, N.A., Gasevic, D., Pardo, A.: A systematic review of empirical studies on learning analytics dashboards: a self-regulated learning perspective. IEEE Trans. Learn. Technol. 13(2), 226–245 (2020). https://doi.org/10.1109/TLT.2019.2916802
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on Bias and fairness in machine learning. ACM Comput. Surv. 54(6), 1–35 (2021). https://doi.org/10.1145/3457607
Meske, C., Bunde, E., Schneider, J., Gersch, M.: Explainable artificial intelligence: objectives, stakeholders, and future research opportunities. Inf. Syst. Manag. 39(1), 53–63 (2022). https://doi.org/10.1080/10580530.2020.1849465
Mökander, J., Axente, M., Casolari, F., Floridi, L.: Conformity assessments and post-market monitoring: a guide to the role of auditing in the proposed European AI regulation. Mind. Mach. 32(2), 241–268 (2022). https://doi.org/10.1007/s11023-021-09577-4
Mökander, J., Floridi, L.: Ethics-based auditing to develop trustworthy AI. Mind. Mach. 31(2), 323–327 (2021). https://doi.org/10.1007/s11023-021-09557-8
Mökander, J., Floridi, L.: Operationalising AI governance through ethics-based auditing: an industry case study. AI Ethics, 1–18 (2022). https://doi.org/10.1007/s43681-022-00171-7
Mora-Cantallops, M., Sánchez-Alonso, S., García-Barriocanal, E., Sicilia, M.-A.: Traceability for trustworthy AI: a review of models and tools. BDCC 5(2), 20 (2021). https://doi.org/10.3390/bdcc5020020
Myers, G. J., Badgett, T., Sandler, C.: The art of software testing. (2012) Now covers testing for usability, smartphone apps, and agile development environments (3. Ed.). Wiley, Hoboken, NJ
Namoun, A., Alshanqiti, A.: Predicting student performance using data mining and learning analytics techniques: a systematic literature review. Appl. Sci. 11(1), 237 (2021). https://doi.org/10.3390/app11010237
National Institute of Standards and Technology.: AI Risk Management Framework, Gaithersburg, MD. Retrieved from (2023)
ORCAA.:Description of Algorithmic Audit: Pre-built Assessments (2020)
Pardo, A., Siemens, G.: Ethical and privacy principles for learning analytics. Br. J. Edu. Technol. 45(3), 438–450 (2014). https://doi.org/10.1111/bjet.12152
Piovesan, A., Griffor, E.: Reasoning About Safety and Security. In Handbook of System Safety and Security. Elsevier, 113–129 (2017). https://doi.org/10.1016/B978-0-12-803773-7.00007-3
Poretschkin, M., Schmitz, A., Akila, M., Adilova, L., Becker, D., Cremers, A. B., Hecker, D., Houben, S., Mock, M., Rosenzweig, J., Sicking, J., Schulz, E., Voss, A., Wrobel, S.: Leitfaden zur Gestaltung vertrauenswürdiger Künstlicher Intelligenz. KI-Prüfkatalog (2021)
Pruitt, J., Grudin, J.:.Personas. In Proceedings of the 2003 conference on Designing for user experiences - DUX '03. ACM Press, New York, New York, USA, 1 (2003). https://doi.org/10.1145/997078.997089
Rahwan, I.: Society-in-the-loop: programming the algorithmic social contract. Ethics Inf Technol 20(1), 5–14 (2018). https://doi.org/10.1007/s10676-017-9430-8
Rai, N.: Why ethical audit matters in artificial intelligence? AI Ethics 2(1), 209–218 (2022). https://doi.org/10.1007/s43681-021-00100-0
Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., Barnes, P.: Closing the AI accountability gap. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. ACM, New York, NY, USA, 33–44 (2020). https://doi.org/10.1145/3351095.3372873
Rhodes, T., Boland, F., Fong, E., Kass, M.: Software assurance using structured assurance case models. J. Res. Nat. Inst. Stand. Technol. 115(3), 209–216 (2010). https://doi.org/10.6028/jres.115.013
Riazy, S., Simbeck, K., Schreck, V.: Fairness in Learning Analytics: Student At-risk Prediction in Virtual Learning Environments. In Proceedings of the 12th International Conference on Computer Supported Education. SCITEPRESS - Science and Technology Publications, 15–25 (2020). https://doi.org/10.5220/0009324100150025
Riazy, S., Simbeck, K., Schreck, V.: Systematic Literature Review of Fairness in Learning Analytics and Application of Insights in a Case Study. In Computer Supported Education, H. C. Lane, Susan Zvacek and James Uhomoibhi, Eds. Communications in Computer and Information Science. Springer International Publishing, Cham, 430–449 (2021). https://doi.org/10.1007/978-3-030-86439-2_22
Ribeiro, M. T., Singh, S., Guestrin, C.: "Why Should I Trust You?". In KDD'16: Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery Inc. (ACM), New York, NY, 1135–1144 (2016). https://doi.org/10.1145/2939672.2939778
Rienties, B., Simonsen, H.K., Herodotou, C.: Defining the Boundaries Between Artificial Intelligence in Education, Computer-Supported Collaborative Learning, Educational Data Mining, and Learning Analytics: A Need for Coherence. Front Educ (2020). https://doi.org/10.3389/feduc.2020.00128
Romero, C., Ventura, S.: Educational data mining and learning analytics: an updated survey. WIREs Data Mining Knowl Discov 10, 3 (2020). https://doi.org/10.1002/widm.1355
Rzepka, N., Simbeck, K., Müller, H-G., Pinkwart, N.: Fairness of in-session dropout prediction. In: Proceedings of the 14th International Conference on Computer Supported Education. SCITEPRESS - Science and Technology Publications, pp. 316–326 (2022). https://doi.org/10.5220/0010962100003182
Salas-Pilco, S.Z., Xiao, K., Xinyun, Hu.: Artificial intelligence and learning analytics in teacher education: a systematic review. Education Sciences 12(8), 569 (2022). https://doi.org/10.3390/educsci12080569
Sandvig, C., Hamilton, K., Karahalios, K., Langbort, C.: Auditing algorithms: Research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry, 4349–4357 (2014)
Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., Hall, P.: Towards a standard for identifying and managing bias in artificial intelligence. https://doi.org/10.6028/NIST.SP.1270.
Shen, H., DeVos, A., Eslami, M., Holstein, K.: Everyday algorithm auditing: understanding the power of everyday users in surfacing harmful algorithmic behaviors. Proc. ACM Hum.-Comput. Interact. 5, CSCW2, 1–29 (2021). https://doi.org/10.1145/3479577.
Shneiderman, B.: Human-Centered Artificial Intelligence: Three Fresh Ideas. THCI , 109–124 (2020). https://doi.org/10.17705/1thci.00131
Shook, J., Smith, R., Antonio, A.: Symposium Edition - artificial intelligence and the legal profession. Tex. A&M J. Prop. L. 4, 5, 443–463 (2018). https://doi.org/10.37419/JPL.V4.I5.2
Slade, S., Prinsloo, P.: Learning analytics: Ethical issues and dilemmas. Am. Behav. Sci. 57(10), 1510–1529 (2013). https://doi.org/10.1177/0002764213479366
Smara, M., Aliouat, M., Pathan, A.-S., Aliouat, Z.: Acceptance test for fault detection in component-based cloud computing and systems. Futur. Gener. Comput. Syst. 70, 74–93 (2017). https://doi.org/10.1016/j.future.2016.06.030
Swamy, V., Radmehr, B., Krco, N., Marras, M., Käser, T.: Evaluating the Explainers: Black-Box Explainable Machine Learning for Student Success Prediction in MOOCs. arXiv (2022)
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R:.Intriguing properties of neural networks. https://doi.org/10.48550/arXiv.1312.6199
Tagharobi, H., Simbeck, K.: Introducing a framework for code based fairness audits of learning analytics systems on the example of Moodle learning analytics. In: Proceedings of the 14th International Conference on Computer Supported Education. SCITEPRESS - Science and Technology Publications, pp. 45–55 (2022). https://doi.org/10.5220/0010998900003182
The White House.: Blueprint for an AI Bill of Rights. Making Automated Systems work for the American People. Retrieved April 18, 2023 from https://www.whitehouse.gov/ostp/ai-bill-of-rights/.
Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A.: Robustness May Be at Odds with Accuracy (2018). https://doi.org/10.48550/arXiv.1805.12152
Tuncali, C. E., Fainekos, G., Ito, H., Kapinski, J.: 2018. Sim-ATAV. In Proceedings of the 21st International Conference on Hybrid Systems: Computation and Control (part of CPS Week). ACM, New York, NY, USA, 283–284 (2018). https://doi.org/10.1145/3178126.3187004
Verma, S., Rubin, J.: Fairness definitions explained. In Proceedings of the International Workshop on Software Fairness - FairWare '18. ACM Press, New York, New York, USA, 1–7 (2018). https://doi.org/10.1145/3194770.3194776
Vokinger, K.N., Feuerriegel, S., Kesselheim, A.S.: Continual learning in medical devices: FDA’s action plan and beyond. The Lancet Digital Health 3(6), e337–e338 (2021). https://doi.org/10.1016/S2589-7500(21)00076-5
Weidenhaupt, K., Pohl, K., Jarke, M., Haumer, P.: Scenarios in system development: current practice. IEEE Softw. 15(2), 34–45 (1998). https://doi.org/10.1109/52.663783
Williams, B., Shmargad.: How algorithms discriminate based on data they lack: challenges, solutions, and policy implications. J. Inf. Policy 8, 78 (2018). https://doi.org/10.5325/jinfopoli.8.2018.0078
Wilson, C., Ghosh, A., Jiang, S., Mislove, A., Baker,L., Szary, J., Trindel, K., Polli, F.: Building and Auditing Fair Algorithms. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, New York, NY, USA, 666–677 (2021). https://doi.org/10.1145/3442188.3445928
Huan, Xu., Mannor, S.: Robustness and generalization. Mach Learn 86(3), 391–423 (2012). https://doi.org/10.1007/s10994-011-5268-1
Funding
Open Access funding enabled and organized by Projekt DEAL. This research has been funded by the German Federal Ministry of Education and Research under grant number 16DHB4002.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Simbeck, K. They shall be fair, transparent, and robust: auditing learning analytics systems. AI Ethics 4, 555–571 (2024). https://doi.org/10.1007/s43681-023-00292-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s43681-023-00292-7