1 Introduction

Artificial intelligence (AI), which refers to both a research field and a set of technologies, is rapidly growing and has already spread to application areas ranging from policing to healthcare and transport (e.g. Rezende, 2020; Stilgoe, 2018; Trocin et al., 2021). The growth in AI applications is set to continue in the near term, and in the long term, AI technologies can transform areas such as scientific methods, foreign policy, and personalized medicine (Tewari, 2022). In general, AI is integrated into information systems and refers to the capabilities of data interpretation, learning, and adaptation that aim to attain human-level capabilities in particular tasks (Kaplan & Haenlein, 2019; Russell & Norvig, 2021). In some cases—for example the optimization of online search results and the filtering of social media feeds—AI has already become commonplace and near invisible.

The increasing AI capabilities and applications bring novel risks and potential harms for individuals and societies, such as lack of transparency and accountability, as well as biases against individuals and groups (Dignum, 2020; Floridi et al., 2018; Martin, 2019). These challenges and risks related to AI systems underscore the importance of AI governance at the organizational, interorganizational, and societal levels (Laato et al., 2022; Mäntymäki et al., 2022a, b; Minkkinen et al., 2022a, b; Schneider et al., 2022; Seppälä et al., 2021). As a closely related parallel to governance, auditing of AI is promoted as a means of tackling risks by holding AI systems and organizations that use AI to certain criteria and by requiring necessary controls (Koshiyama et al., 2021; Minkkinen et al., 2022a, b; Mökander et al., 2021; Sandvig et al., 2014). In addition to tackling risks, auditing of AI has been promoted as a new industry and a source of economic growth (Koshiyama et al., 2021). Nevertheless, auditing faces challenges owing to the nature of some AI technologies. Traditionally, auditing has been conducted periodically or cyclically, in which case audits represent snapshots of systems and processes. In snapshot audits, timing is crucial because an early audit can influence an AI system’s design and operations more than a post-deployment audit of a production system can (Raji et al., 2020b; cf. Laato et al., 2022a, b). Whilst many AI systems use fairly static models with periodic updates, some systems, such as those based on reinforcement learning, adapt as a result of highly complex models, which means that they may exhibit unpredictable results (Dignum, 2020; Falco et al., 2021; Kaplan & Haenlein, 2019). Learning and adaptation present benefits but also potential risks, as AI systems learn patterns that are not hard-coded by designers. Adaptation presents a specific challenge for snapshot auditing because a system that is deemed compliant at one point may not be compliant later. In addition, the operating and evolution speeds of AI systems are much faster than those of human-led snapshot auditing processes, which are usually relatively cumbersome.

As the challenges of snapshot audits were already apparent before the recent growth of AI adoption, the continuous auditing (CA) concept was introduced in 1989 (Groomer & Murthy, 1989) in response to the need for near-real-time auditing information. Auditing AI and CA are a natural match because CA can potentially keep pace with the AI system’s evolution and continuously provide up-to-date information on its performance according to set criteria. The rationale for CA is linked to the aspired human oversight of AI systems (Floridi et al., 2018; Shneiderman, 2020). On the one hand, CA may challenge human agency by transferring part of auditing to machines, but on the other hand, it may also free human capacity to conduct higher-level auditing tasks. Provisionally, CA of AI systems appears most relevant to organizations’ internal audit functions (cf. Raji et al., 2020b; Tronto et al., 2021) as opposed to external auditing conducted by independent auditors, although this may change as the audit ecosystem continues to evolve (Mökander et al., 2022).

The potential of continuous AI auditing approaches has already been noted by the European Union (EU), whose proposed that AI Act (European Commission, 2021) includes provisions for the mandatory post-market monitoring of high-risk AI systems. In the proposed EU regulation, the providers of high-risk AI systems would need to draft post-market monitoring plans to document the performance of these systems throughout their life cycles after they are introduced to the market (Mökander et al., 2022). However, although CA is a mature concept (e.g. Eulerich & Kalinichenko, 2018; Shiue et al., 2021; Vasarhelyi & Halper, 1991), we were unable to find an established literature stream specifically on CA of AI (CAAI) beyond general calls for monitoring the impacts of algorithmic systems (e.g. Doneda & Almeida, 2016; Metcalf et al., 2021; Shah, 2018; Yeung et al., 2020).

To address the paucity of the CAAI literature, this study has been positioned to answer the following research question: What is continuous auditing of artificial intelligence, and what frameworks and tools exist for its execution? The current paper advances the body of knowledge on auditing of AI (Brown et al., 2021; Koshiyama et al., 2021; Mökander et al., 2021; Sandvig et al., 2014) in two ways. First, we connect the research on auditing of AI and CA, introducing the CAAI concept. Second, we present an assessment of the suitability of AI auditing frameworks and tools for CAAI. In particular, we adopt a bottom-up approach and investigate tools and methods for CAAI. By conceptualizing CAAI and surveying frameworks and tools, this study lays the foundation for continued research and practical applications within the field of CAAI.

The remainder of the paper is structured as follows. In the Sect. 2, we introduce auditing of AI and CA and posit that CAAI lies at the intersection of the two. We then present our materials and methods, providing an overview of the examined auditing frameworks and tools and our assessment criteria for CA. The Sect. 4 assesses the suitability of the frameworks and tools for CA. The paper ends with the Sect. 5, which lays out the state of the art in CAAI, explores lessons from existing CA frameworks, and discusses limitations and future research directions.

2 Conceptual Background

2.1 Auditing of AI

The literature discusses auditing of AI under various terms. The early literature (Sandvig et al., 2014) and subsequent research (Brown et al., 2021; Galdon Clavell et al., 2020; Koshiyama et al., 2021) refer to “algorithm auditing” as a means to discover and mitigate discrimination and other problematic consequences of the use of algorithms. Interest in auditing algorithms has grown in conjunction with the increasing capabilities and power of inscrutable “black-box” algorithms that support decision-making and impact people and organizations (Pasquale, 2015).

The recent literature has introduced the concept of the ethics-based auditing (EBA) of automated decision-making systems (Mökander et al., 2021). EBA is defined as “a structured process whereby an entity’s present or past behaviour is assessed for consistency with relevant principles or norms” (Mökander et al., 2021, p. 1). This definition usefully leaves the audited entity open; thus, the targets of auditing may be algorithms, AI systems, or organizations. Brown et al., (2021, p. 2), in turn, defined ethical algorithm audits as “assessments of the algorithm’s negative impact on the rights and interests of stakeholders, with a corresponding identification of situations and/or features of the algorithm that give rise to these negative impacts”. The difference between these two definitions is that ethical algorithm audits focus on impact, whilst EBA highlights consistency with principles and norms. The definition of the ethical algorithm audit (Brown et al., 2021) also posits algorithms as the target of auditing rather than leaving the audited entity open.

For our conceptualization and assessment, we consider auditing of AI to encompass both principle- and impact-based approaches, preferring not to delimit the field prematurely. We acknowledge the existence of several types of AI auditing, such as auditing system performance. For example, according to the EU High-Level Expert Group, trustworthy AI consists of three components: AI should be lawful, ethical, and technically robust (High-Level Expert Group on Artificial Intelligence, 2019). These components are not fully independent of each other; for example, ethical concerns can lead to legal consequences, and a lack of technical robustness can lead to ethical concerns (Floridi et al., 2022). However, in the following discussion of CAAI, our primary focus is on the consideration of ethical issues and potential harm, such as matters of safety, in line with most of the current literature on auditing of AI (e.g. Falco et al., 2021; High-Level Expert Group on Artificial Intelligence, 2019; Mökander et al., 2021). A further argument in favour of an ethics focus is that companies have economic incentives to develop high-performing AI systems, but auditing to ensure safe and ethically responsible AI requires further research on tools and frameworks.

We use the term Sect. 2.1 to highlight that our study focuses on auditing of AI rather than auditing using AI. There is a separate and growing stream of literature on the use of AI and other novel technologies to aid auditing (e.g. Kokina & Davenport, 2017). In contrast to this literature, we investigate auditing of AI systems to discover and mitigate potential risks, harms, and breaches of standards. Whilst technical tools may play a significant role in auditing, in our study, AI is the target of auditing rather than the means.

2.2 Continuous Auditing

Traditionally, auditing procedures have been performed on a cyclical basis—for example once a month—after business activities have occurred (Coderre, 2005). Breaking with this cyclical approach, CA was first introduced by Groomer and Murthy (1989), and then Vasarhelyi and Halper (1991) applied a monitoring layer for auditors (Shiue et al., 2021; Yoon et al., 2021). Whilst the concept of CA has existed since the 1980s, and multiple definitions have been presented, no standard definition exists. The American Institute of Certified Public Accountants (AICPA, 1999) defined CA as “a methodology that enables independent auditors to provide written assurance on a subject matter using a series of auditors’ reports issued simultaneously with, or a short time after, the occurrence of events underlying the subject matter”. Focusing on the auditing component, the Institute of Internal Auditors defined internal auditing as follows:

an independent activity of objective assessment and of consulting designed to add value and improve operations of organizations while achieving their objectives through a systematic and disciplined approach in the evaluation of effectiveness of risk management, control and governance processes. (Institute of Internal Auditors, 2022).

Thus, auditing in general aims to serve organizations by evaluating risk management, controls, and governance, and CA introduces a further real-time component to it.

Compared to traditional auditing, CA features more frequent audits, a more proactive model, and automated procedures (Yoon et al., 2021). CA definitions include elements such as the processes of collecting and evaluating data, ensuring the real-time efficiency and effectiveness of systems, and performing controls and risk assessments automatically (Coderre, 2005; Marques & Santos, 2017). Two main activities emerge with CA: continuous control and risk assessments (Coderre, 2005). They focus on auditing systems as early as possible and highlight processes or systems that experience higher-than-expected levels of risk. In addition, CA changes the role of the auditor; the nature, timing, and extent of the auditing; and the nature of audit reporting, data modelling, data analytics, and monitoring (Yoon et al., 2021). In particular, the role of internal auditors has changed, as they not only control audit activities but also monitor risk controls and identify areas in which risk management processes can be improved (Coderre, 2005). Eulerich and Kalinichenko (2018, p. 33) synthesized previous definitions and defined CA as follows:

a (nearly) real-time electronic support system that continuously and automatically audits clearly defined “audit objects” based on pre-determined criteria. CA identifies exceptions and/or deviations from a defined standard or benchmark, and reports them to the auditor. With this continuous approach, the audit occurs within the shortest possible time after the occurrence of an event.

CA brings many benefits. It reduces risks, diminishes fraud attempts, facilitates the objectives of internal control, allows timely access to information, integrates internal and external stakeholders and helps external auditing, allows timely adjustments, and modifies auditors’ routine tasks, thereby allowing them to focus on more important responsibilities (Marques & Santos, 2017). Moreover, it increases confidence in transactions and operational processes, decision-making, and financial statements (Marques & Santos, 2017). Audit executives often prefer ongoing assessments rather than periodic reviews (Coderre, 2005). The next stage in audit development is CA utilizing computer science technologies, as researchers have provided solutions to the development of CA in organizational auditing (Wang et al., 2020).

2.3 Towards Continuous Auditing of Artificial Intelligence

Drawing on CA and auditing of AI, this study introduces the concept of CAAI, which is a type of auditing that exists at the intersection of CA and auditing of AI (Fig. 1). CAAI is CA that targets AI systems and corresponding organizations. In other words, CA provides the auditing methods, and auditing of AI provides the audit object. The intersectional position of CAAI means that it is a subset of both CA and auditing of AI. Not all CA targets AI systems; conversely, not all auditing of AI uses continuous approaches.

Fig. 1
figure 1

Continuous auditing of artificial intelligence at the intersection of continuous auditing and auditing of artificial intelligence

The following is our working definition of CAAI: CAAI is a (nearly) real-time electronic support system for auditors that continuously and automatically audits an AI system to assess consistency with relevant norms and standards.

In line with a recent definition (Eulerich & Kalinichenko, 2018), we conceptualize CA as a (nearly) real-time electronic support system for auditors. Because CA definitions emphasize the automated nature of auditing, we decided to delimit the concept to the technical component. Nevertheless, CA operates in socio-technical systems together with human auditors. The AI system is posited as the audit target, which gives CAAI a clear focus and differentiates it from other types of auditing, such as financial auditing. The investigated AI system gives boundaries to CAAI, and eventually, organizations may complement it with broader auditing practices. Moreover, we draw on the EBA definition to highlight consistency with particular norms and standards (Mökander et al., 2021). These are defined by law, ethics, and societal norms, and they change over time. In the case of AI systems, the relevant norms and standards can also entail the examination of a system’s potential impacts. Compared to the EBA definition (Mökander et al., 2021), we omit “principles” because, in our view, for principles to be continuously audited, they need to be operationalized into norms and standards.

Like continuous auditing generally, CAAI markedly changes the temporality and tempo of auditing, whereby the audit of past or present events becomes the almost real-time monitoring of current events. Hence, the temporality of auditing comes closer to that of audited AI systems. Because CAAI requires continuous access to AI systems, it appears most relevant to internal audit functions within organizations (cf. Raji et al., 2020b; Tronto et al., 2021) as opposed to external auditing conducted by independent auditors. However, internal and external auditing roles may develop as the audit ecosystem evolves (Mökander et al., 2022). CA also changes the division of labour between humans and machines because the auditor can focus their attention on more interpretive and complex tasks rather than on processing data (Eulerich & Kalinichenko, 2018).

3 Materials and Methods

3.1 Overview of Studied Papers

In this study, we assessed AI auditing tools and frameworks vis-à-vis their suitability to CAAI. Table 1 presents the descriptive details of the papers included in this assessment. The papers were selected based on targeted searches of auditing together with AI or near-synonyms, such as “machine learning”, “deep learning”, “algorithm”, and “black box”. The goal was to summarize the most important AI auditing tools and frameworks and review their suitability for CAAI. We selected studies that addressed auditing of AI and developed either a tool or a framework. The top row shows the author(s), publication year, name of the conference or journal in which the paper was published, and the tool or framework presented. The majority of the selected papers were conference proceedings, followed by journal articles and a few grey literature articles.

Table 1 Overview of the studied papers

The included papers were assessed for suitability vis-à-vis CAAI using the criteria introduced in the following section.

3.2 Assessment Criteria for Continuous Auditing

We derived six assessment criteria from the CA definitions in the literature (Table 2). As presented in Sect. 2.2, continuous AI auditing consists of a continuous system in which processes are repeated regularly and automatically and which has a process for collecting and evaluating data based on predetermined criteria. Studies were given one point for each criterion met, and the total number of points was later used as a baseline when considering suitability for CA. Data collection, automation, and predetermined criteria were the most common criteria met, as they are also typical attributes of non-continuous AI auditing. Fulfilment of legal requirements and real-time possibilities followed these most common attributes in prevalence. Audit processes that were repeated regularly were the least used criteria.

Table 2 Criteria derived from continuous auditing definitions

Table 3 shows all the studies and how they meet the assessment criteria. “Publication” refers to whether the study was a journal article or conference proceeding (Yes) or a grey literature paper (No). “Type” reflects whether the study develops a tool (T) or a framework (F). By “framework”, we mean a conceptual model that presents a set of components and interrelations. In turn, “tool” means a practically applicable tool or set of tools to audit some aspects of AI systems or organizations’ use of AI systems. Frameworks were somewhat more common than tools, as 22 studies developed a framework and 13 developed a tool.

Table 3 Papers assessed against the continuous auditing criteria

Suitability for CA was determined based on the fulfilment of or failure to fulfil the criteria. One point was awarded for the fulfilment of each criterion, after which the points were totalled in the suitability column. Therefore, the more criteria the paper met, the greater its suitability for continuous AI auditing.

4 Findings

The following sections provide an assessment of the auditing tools and frameworks, organized into three clusters: high suitability for CAAI (5–6 points on the criteria introduced in the previous section), medium suitability for CAAI (3–4 points), and low or uncertain suitability for CAAI (0–2 points). Under each cluster, we describe the frameworks and tools currently available in published sources.

4.1 High Suitability for Continuous Auditing of AI (5–6)

The papers that received five or six points based on the assessment were ranked as high-suitability papers for CAAI. This means that these papers either dealt directly with CAAI or satisfied all the CA criteria, making the tools and frameworks they presented suitable for continuous AI auditing, at least provisionally. Seven papers achieved high-suitability status, with six developing a new framework and one developing a tool for CA. The characteristic of the high-suitability papers was that they aimed to define CA or clearly considered its criteria.

The focus of the developed frameworks varied. Lee et al. (2020) and three non-academic papers (Byrnes et al., 2018; ICO, 2020; PDPC, 2020) discussed the evolution of AI and sought to develop guidance for future AI auditing. The rest of the high-suitability papers sought to solve specific problems. For example, D’Amour et al. (2020) provided an open-source software framework for studying the fairness of algorithms, and Pasquier et al. (2016) focused on cloud infrastructure, providing systems to continuously monitor information flows within cloud infrastructure and detect malicious activities and unauthorized changes. Amongst the high-suitability papers, non-academic ones in particular focused on the development of existing AI systems in ways that could be suitable for CAAI. They consider current and future challenges, other AI problems, and how to develop the field in the future.

Two tools were considered highly suitable for CAAI: FairLearn, developed by Bird et al. (2020), and capAI by Floridi et al. (2022). FairLearn is an open-source toolkit that improves the fairness of AI systems. It has an interactive visualization dashboard and unfairness mitigation algorithms that manage trade-offs between fairness and model performance. The goal of FairLearn is to mitigate fairness-related harm. Fully achieving guaranteed fairness is challenging, as societal and technical systems are highly complex. FairLearn recognizes a wide range of fairness-related harms and ways to improve fairness and detect unfair activities. For example, an AI system can unfairly allocate opportunities, resources, or information or fail to provide all people with the same quality of service. In addition, it can reinforce existing stereotypes, denigrate people, or overrepresent or underrepresent groups of people. The aim of FairLearn is to address the gap in software, thus tackling these fairness issues continuously and focusing in particular on negative impacts.

The main purpose of capAI (Floridi et al., 2022), in turn, is to serve as a governance tool. It aims to ensure conformity with the EU’s Artificial Intelligence Act by demonstrating that AI systems are developed and operated in a trustworthy manner. CapAI views AI systems across the entire AI life cycle from design to retirement. It defines and reviews current practices and enables technology providers and users to develop ethical assessments at each stage of the AI life cycle. The procedure consists of an internal review protocol, an external scorecard, and a summary data sheet. These allow organizations to conduct conformity assessments and the technical documentation required by the AIA. They produce a high-level summary of the AI system’s purpose, functionality, and performance and summarize relevant information about the AI system’s purpose, values, data, and governance (Floridi et al., 2022).

Overall, the high-suitability papers dealt with developing automated CA systems that met each assessment criterion. Therefore, even if the paper did not explicitly develop a framework or tool for CAAI, it was considered suitable for this purpose. An automated and continuous system, which is repeated regularly, was an essential aspect of most frameworks and tools. ICO (2020), PDPC (2020), and Byrnes et al. (2018), all of which touched on the future of AI, noticed themes arising in relation to CA. Interestingly, only one tool was developed in high-suitability papers. This could indicate that the discussion is centred more on the general definition and direction of CAAI than on the development of new tools.

4.2 Medium Suitability for Continuous Auditing of AI (3–4)

Ten papers received three or four points from our assessment: Six were frameworks, and four were tools. The most typical characteristics of the medium-suitability papers were real time, data collection, automation, and predetermined criteria. “Repeat” was clearly the least common criterion fulfilled, followed by “legal requirements”. This indicates that medium-suitability papers may be suitable for continuous AI auditing, but in principle, they were not designed for CA. However, as seven of the 10 medium-suitability papers matched the real-time criteria, it can be stated that the difference between medium- and high-suitability papers is small in practice and that medium-suitability papers are also relevant to continuous AI auditing.

The tools and frameworks with medium suitability are divided into ethics-based frameworks and technical approaches to specific problems. On the ethics-based side, Brown et al. (2021) presented an auditing framework to guide the ethical assessment of an algorithm. Regarding the non-academic papers in the medium-suitability category, there were many similarities with the non-academic papers in the high-suitability category. In Artificial Intelligence: Australia’s Ethics Framework, Dawson et al. (2019) covered civilian applications of AI with the goal of developing best practice guidelines. Similarly, the Dutch information society platform ECP (2018) wrote an AI impact assessment framework to build guidelines for the rules of conduct of autonomous systems, and the WEF (2020) wrote a policy framework addressing responsible limits regarding facial recognition. All these frameworks cover ethical aspects of the development of AI, taking into account the characteristics of AI systems, but real-time capabilities and the repeated nature of procedures are given less attention.

Amongst the technical approaches were, for instance auditing frameworks focusing on black-box auditing and bias. The automation of activities was the focus of the fully automated black-box auditing framework by Drakonakis et al. (2020). The framework aims to detect authentication and authorization flaws when handling cookies that stem from the incorrect, incomplete, or non-existent deployment of appropriate security mechanisms. Sulaimon et al. (2019) and Thangavel et al. (2020) focused on security, bias, and data issues. Sulaimon et al. (2019) proposed a control loop, which is an adaptation of the Monitor, Analyse, Plan, Execute, and Knowledge control loop for autonomous systems. Their goal is to ensure fairness in the decision-making processes of automated systems by adapting the existing bias detection mechanism. Thangavel et al. (2020) also aimed to develop existing systems to increase and maintain cloud users’ trust in cloud service providers. They proposed a novel integrity verification framework that performs block-, file-, and replica-level auditing to verify data integrity and ensure data availability in the cloud.

Ethical considerations played an essential role in the tools presented in the medium-suitability papers. AI Fairness 360 by Bellamy et al. (2019) and FlipTest by Black et al. (2020) focused on fairness issues in AI systems. Their main objective is to help facilitate fairness algorithms for users to progress as seamlessly as possible from raw data to a fair model. Greater fairness, accountability, and transparency in algorithmic systems were also the objectives of the Algorithmic Equity Toolkit by Katell et al. (2020) and the Counterfactual Explanations for Robustness, Transparency, Interpretability, and Fairness of Artificial Intelligence models (CERTIFAI; Sharma et al., 2019). Zicari et al. (2021) assessed AI trustworthiness by developing the Z-Inspection process that assesses and seeks to resolve ethical issues and tensions in AI usage domains.

In summary, medium-suitability papers offer important guidelines and tools for continuous AI auditing. CA was not the core focus of the papers, but similarities and applicability to continuous AI auditing were seen. In particular, continuous and real-time opportunities were a point of interest in the medium-suitability papers. However, systems which operate repeatedly and automatically did not stand out as strongly as they did in papers that received five or six suitability assessment points. Additionally, the medium-suitability papers did not recognize or define the concept of CA as clearly as the high-suitability papers did.

4.3 Low or Uncertain Suitability for Continuous Auditing of AI (0–2)

Papers that received 0–2 points from the assessment were ranked as low- or uncertain-suitability papers. Eighteen papers were considered to have low or uncertain suitability; this was clearly the largest category amongst the studied papers. The low suitability score means that the CA criteria were neither mentioned nor specified in these papers; in particular, real-time and repeat criteria were not found. The most common criteria fulfilled in this category were “predetermined criteria” and “legal requirement”, followed by “data collection”. Owing to their low suitability for CA, we do not discuss these papers in detail. However, it should be noted that some frameworks and tools could nevertheless be adapted to suit CA. For example, formulating guidelines for ethical AI auditing, bringing principles into practice, and designing tools for specific issues might bring significant insight into the continuous AI auditing discussion, even though the framework or tool itself is not intended for CA.

5 Discussion and Conclusion

We draw out the central implications of our conceptualization and assessment of CAAI frameworks and tools in the following sections. First, we lay out the state of the art in CAAI. Then, we point to lessons from existing CA frameworks from fields other than AI. Finally, we discuss the central problem of automation and human oversight, and we conclude the paper with limitations and future research directions.

5.1 The State of the Art in Continuous Auditing of AI

CAAI is an emerging field, and we are only beginning to draw its contours. Whilst we were able to find literature on auditing of AI and CA, none explicitly connected these two topics as the core focus of the paper. At the same time, based on our overview, there is significant potential for continuous approaches to the auditing of AI. In the following paragraphs, we present provisional rather than definite conclusions because the area is moving quickly and many frameworks and tools may have untapped potential.

To sum up the findings in the previous section, no clear pattern is emerging from the high-suitability audit tools and frameworks. They are highly heterogeneous software frameworks, risk management frameworks, and other auditing tools. Considering the criteria for judging CA, it seems that the automated, real-time, and repeated nature of auditing are essential criteria for the continuous nature of AI auditing. The remaining criteria (data collection, predetermined criteria, and legal requirements) condition the specific type of auditing and are part of traditional non-continuous auditing.

Given the early stage of the conceptualization of CAAI, it is useful to consider the basic distinctions between potential CAAI tools. One clear difference is between sector-specific tools (e.g. healthcare) and cross-sectoral tools. There is a trade-off between sector-specific tools that can focus on sectorally relevant issues (e.g. privacy issues in healthcare) and general tools that are more abstract and may either leave out sectorally important AI governance issues or include irrelevant issues. Another crucial axis is the desired level of automation in the overall auditing process. With a comparatively low level of automation, CAAI can assist auditors and provide additional information on the fairness of algorithms, for instance. If the desired level of automation is high, the auditing process can be automated to a large extent. Then, the human auditor has a more limited role akin to the “human-on-the-loop” model, whereby automated systems can make decisions, but a human oversees them and intervenes in the case of incorrect decisions (Benjamins, 2021).

Our study made the distinction between frameworks and tools, which may be difficult to discern in practice. Going forward, we hypothesize that both general frameworks and specialized tools are needed in CAAI. Practical tools are likely to be most valuable when used as part of a more general auditing framework that contextualizes the tools and the information they provide. CAAI could thus be seen as a nested system with an overarching framework and a set of specific tools under this framework.

Considering recent developments in AI regulation, one strong candidate for an overarching general framework is the proposed EU AI Act (European Commission, 2021). The AI Act proposal includes provisions for the mandatory post-market monitoring of high-risk AI systems, which requires AI system providers to draft post-market monitoring plans to document the performance of high-risk AI systems throughout their life cycles (Mökander et al., 2022). At present, however, the AI Act leaves the practical implementation of post-market monitoring largely open (Mökander et al., 2022). This is where CAAI frameworks and tools could contribute by concretizing the generic AI Act and providing practical tools for AI developer organizations. The EU AI Act may thus increase the demand for CAAI tools. At the same time, CAAI tools can also offer help for ethics-based AI auditing in areas not covered by the EU AI Act, including lower risk systems and broader ethical and impact issues. In other words, CAAI tools could supplement legally binding requirements and support corporate social responsibility and business ethics.

To take the CAAI field forward, it is useful to look at the possible types of CAAI differentiated by their maturity. Here, we draw on Gregor and Hevner’s (2013) distinction between two axes—solution maturity (low/high) and application domain maturity (low/high)—which were developed to enable the understanding of different types of design science contributions. These axes yield four possible types of CAAI frameworks and tools: improvement (new solutions for known problems), invention (new solutions for new problems), routine design (known solutions for known problems), and exaptation (extending known solutions to new problems). Table 4 lays out these different types.

Table 4 Solution maturity and application domain maturity in continuous auditing of artificial intelligence (based on Gregor & Hevner, 2013)

Based on our assessment of frameworks and tools, the CAAI field is not yet in the stage of routine design because CAAI solutions are still emerging. According to a review of the design science literature, invention, the generation of new solutions for new problems, is rare in practice (Gregor & Hevner, 2013). However, there is significant scope for improvement and exaptation in the CAAI field. The present study has focused largely on improvement—that is, the development of new CAAI solutions for known problems. This means that the problems—regarding AI ethics and trustworthiness, for example—are known, but new tools are needed to improve the maturity of the solutions. The final category, exaptation, means that existing solutions would be extended to new problem areas. In this case, it means applying CA frameworks from other fields to auditing of AI. This is a promising direction that complements our study of existing AI auditing solutions by looking at CA solutions and asking what could be learned regarding the auditing of AI. We turn to this question in the following section.

5.2 Drawing Lessons for AI Auditing From the Existing Continuous Auditing Frameworks

This study assessed frameworks and tools intended for auditing of AI in light of their suitability for CA. A logical next step is to approach the issue from the opposite direction and draw lessons from existing frameworks developed for the CA of entities other than AI systems. How can aspects of CA frameworks be adapted to audit AI systems?

The CA literature presents numerous CA frameworks intended for financial and IT auditing. For example, Yoon et al. (2021) and Shiue et al. (2021) developed frameworks for CA systems. Yoon et al. (2021) presented a CA system with alarms for unusual transactions and exceptions on three levels. Shiue et al.’s (2021) work explores key criteria for implementing CA systems based on two approaches: an embedded audit module and a monitoring and control layer. Going further, Majdalawieh et al. (2012) designed a full-power CA model which supports business process-centric auditing and business monitoring whilst enabling the fulfilment of compliance requirements within internal and external policies and regulations. Their model has three objectives: build a CA model on the principle of continuous monitoring and with predefined components, facilitate the integration of CA and business information processing within an enterprise using different building blocks, and give practitioners insight into the state of the adoption of CA in the enterprise and how it will enhance their audit effectiveness and audit efficiency. Tronto and Killingsworth (2021) also focused on developing a continuous monitoring tool for collaboration between internal auditing and business operations. Kiesow et al. (2014) recognized the problems with the implementation of CA and noted that traditional audit tools neglect the potential of Big Data analytics. Therefore, they strived to develop a computer-assisted audit solution. Wang et al. (2020) proposed a continuous compliance awareness framework to audit an organization’s purchases in an automatic and timely manner. Eulerich et al. (2022) developed a three-step evaluation framework to facilitate robotic process automation and assist auditors in deciding what activities should be automated.

Common to these existing CA solutions is that they are organized around business and accounting processes, such as purchase orders and invoices. In contrast, auditing of AI focuses on auditing an AI system’s consistency with relevant norms and standards. Owing to this difference in focus, the existing CA frameworks cannot be directly adopted as CAAI frameworks; instead, they need to be adapted to serve as CAAI solutions. The further development of their adaptation is beyond the scope of this paper, but we offer three points that are relevant to this future work:

  • The scope of CAAI needs to be specified, particularly whether CAAI should focus on a specific algorithmic system or if it extends more broadly to auditing organizations’ use of AI systems in the future.

  • The human–machine division of labour needs to be considered to define which aspects of AI auditing should be automated and which should not.

  • Emerging CAAI systems must be considered in light of the emerging actor landscape and institutions of AI governance, such as a possible AI regulatory body in the European Union (Stix, forthcoming).

5.3 Automation and Human Oversight

Whilst the automation of auditing promises efficiency, there is a risk of introducing a second-order problem. If opaque and unpredictable automated systems are the original problem, can automated auditing also become opaque and unpredictable? The assurance of AI systems could thus lead to a kind of infinite regress: the systems that audit AI systems need to be audited, the systems that audit the auditing systems need to be audited, and so on. As an organizational response to this problem, the established “three lines of defence” model, which includes operational management, risk management functions, and internal audits, could be adapted to manage AI risks (Institute of Internal Auditors, 2020; cf. Financial Services Agency of Japan, 2021). Addressing a similar problem, Metcalf et al. (2021) wrote about “the assessor’s regress” in the context of impact assessments, whereby the completeness of an assessment relies on a never-ending chain of justification. Their answer to this dilemma is that a forum and a legitimate accountability relationship must exist to close the regress. In the CAAI context, mechanisms for creating trust in the auditing system are needed. However, exploring more details about such mechanisms and the connections to the three lines of defence model is beyond the scope of this paper.

On a broader societal level, CA raises a challenge regarding the widely accepted notions of the human oversight of AI systems (Floridi et al., 2018; Shneiderman, 2020). Ensuring human oversight, human-centricity, and agency over opaque AI systems is one of the central principles of AI ethics (Dignum, 2020; High-Level Expert Group on Artificial Intelligence, 2019). Against this background, CAAI can be seen to diminish human control and understandability in the auditing process because part of auditing work is transferred to machines.

However, there is another possible reading of CAAI from a human-centric AI perspective. It can be argued that outsourcing part of the mechanical auditing work to machines frees human auditors to focus on higher level auditing and oversight tasks. If CAAI is designed in a human-centric manner, it can augment rather than diminish human capabilities. Transferring oversight tasks from humans to machines can paradoxically increase human oversight of AI systems, but this requires an appropriate CAAI design.

The general conclusion from this discussion is that CAAI systems should be kept relatively simple and transparent to avoid adding layers of opaqueness and complexity to already complex systems. In this case, the assurance of CA is more straightforward than the assurance of complex algorithmic systems. Another potential solution to the second-order assurance problem is the standardization and issuing of certifications for CA products to create trust in CA. At least initially, we can assume that the assurance of CA processes is more straightforward than the assurance of complex AI systems and the assessment of their societal impacts.

5.4 Limitations and Future Research Directions

As a foray into a novel topic, this study has some limitations. It is still too early to conduct a systematic literature review specifically on CAAI; hence, our assessment’s coverage of relevant publications, frameworks, and tools may be incomplete. However, this is a challenge with any discussion on a fast-moving topic, such as AI auditing, in which technologies and legislation continuously co-evolve. Moreover, our study does not cover the technical aspects and processual details of CAAI. In other words, we do not delve into the complexities of gaining visibility into black-box systems. Further technical and organizational research is likely needed for CAAI to be practically feasible.

Owing to its exploratory nature, this study suggests significant areas for future research. As CAAI frameworks and tools mature, a systematic literature review will become a helpful tool for gaining a bird’s-eye view of the developing field. In addition, studies could drill down into sectoral requirements and actor dynamics in particular industries, such as healthcare, public administration, and finance. The interplay between sectoral legislation, generic AI legislation, and ethical and stakeholder requirements provides rich avenues for case studies; comparative studies; and, eventually, quantitative studies on a larger scale.