1 Introduction

A study by McKinsey estimates that, by 2030, artificial intelligence (AI) could displace 15% of the global workforce or 400 million workers and hit the accounting profession particularly hard (Manyika and Sneader 2018). Indeed, the release of ChatGPT, a large language model developed by OpenAI and one of the fastest-growing technologies in history (e.g., Reuters 2023), has revived a discussion of how AI and automation will change the accounting profession. When ChatGPT was released in November of 2022, news stories discussed how it would disrupt the business world. For example, Eloundou et al. (2023) report “that around 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of [large language models], while approximately 19% of workers may see at least 50% of their tasks impacted.” They list, among others, accountants, auditors, and tax preparers as having a 100% exposure to significant automation. Concurrently, large accounting firms have announced the use of large language model-based AI systems: PwC and KPMG will spend $1 billion and $2 billion, respectively, while EY has already spent $1.5 billion (PwC 2023a; The Wall Street Journal 2023a, b).

These are not the first predictions about how technology will revolutionize the accounting profession. Other technology “revolutions” have included the introduction of the computer; software improvements like spreadsheets, databases, and ERP systems; continuous auditing; distributed ledger technology (blockchain); and automation tools, like robotic process automation (RPA). While none of these proved to be the “end of accounting” (or even an immediate drastic change in accounting), the question of whether this time will differ is a tantalizing topic that continues to attract attention and drives panels, press, and presentations.

We discuss several technologies that proved to be overhyped, as a caution to interpreting the effect of ChatGPT on accounting. We then contribute to the question of whether this time is different by examining the performance of a large language model on accounting content. Specifically, we test how well ChatGPT performs on the Certified Public Accountant (CPA), Certified Management Accountant (CMA), Certified Internal Auditor (CIA), and Enrolled Agent (EA) certification exams.

An initial study by Wood et al. (2023) suggests that the ChatGPT hype is not likely to result in massive disruption. The authors find that ChatGPT 3.5 vastly underperformed accounting students, as it could only score around 50% on accounting exams, compared to students who scored higher than 75% on the same exams. A related study conducted by accounting journalist Steve Gaetano in 2023 shows that Chat-GPT 3.5 performed poorly on accounting certification exams—with scores ranging from 35 to 48% on sections of the CPA exam.

Although the initial testing suggests that large language models struggle to answer accounting assessment questions, additional testing is necessary because the models are improving rapidly. OpenAI released ChatGPT 4 in March 2023 with statistics showing a significant improvement over ChatGPT 3.5. For instance, the 3.5 model scored in the 10th percentile on the bar exam, but ChatGPT 4 scored in the 90th percentile (OpenAI 2023). We test how much the new model and additional refinements to the ChatGPT model improve performance. We perform the following tests and document how much each successive test improves performance:

  1. 1.

    Use ChatGPT 3.5 to establish a baseline and compare it to prior research.

  2. 2.

    Examine how much using ChatGPT 4 improves performance.

  3. 3.

    Train ChatGPT 4 using few-shot training and measure performance.Footnote 1

  4. 4.

    Turn ChatGPT 4 into an agent with reasoning and acting abilities (ReAct) and measure performance.Footnote 2

We take a random sample of 150 to 300 questions for each part of each exam for these tests.Footnote 3

Our results show that the original ChatGPT 3.5 model performs like the Wood et al. (2023) tests using accounting assessment questions. The average across all parts of exams was 53.1%, compared to Wood et al.’s (2023) overall average of 55%. Using the new ChatGPT 4 model markedly improves scores by an average of 16.5%. Providing few-shot training further improves scores by an additional 6.6%, and allowing ChatGPT to react and reason improves scores by an additional 8.9%. The results are that ChatGPT 4, with few-shot training and the ability to react and reason, results in an overall average score of 85.1% across all content tested. The updated ChatGPT performance is sufficient to easily pass all sections of the multiple-choice questions to be a CPA, CMA, CIA, and EA. This is far better than the results reported by Geatano (2023) for the CPA exam, which showed an average performance of 42% across the four sections of the exam.

To our knowledge, these are the first large-scale results that AI performs as well as or better than many accounting professionals. While this does not definitively suggest that this time will differ, the results suggest that it may. To further the discussion of whether ChatGPT is overhyped and the potentially large effects it could have, we discuss examples of how it is currently changing accounting (including accounting academia). We note that overhyped technologies are usually spoken of in what they will do while technologies that are not overhyped are spoken of in what they are already doing. We show how ChatGPT is already being used in significant and meaningful ways in accounting.

While we want to do not want to overhype ChatGPT, our empirical and anecdotal evidence provide compelling evidence that it and similar technologies will significantly, maybe even dramatically, affect accounting and accounting education. So maybe the right question is not whether we are overhyping this technology but rather whether we are hyping it enough.

2 Technological changes in accounting

Technological changes have occurred throughout the history of accounting and have profoundly impacted the profession. Indeed, double-entry bookkeeping was a major technological innovation that spurred great changes in business and society (e.g., Williams 1978). More recently, research shows that greater use of technology is associated with many benefits in accounting (e.g., Cardinaels et al. 2019; Chen and Srinivasan 2023; Eulerich et al. 2023a, b, c; Rozario and Zhang 2023). Yet the effects of technology are usually incremental; achieving them takes time. This pattern of behavior is consistent with the Gartner Hype cycle (Fenn and Raskino 2008), which is a visual model that shows the stages of hype and expectations surrounding new technologies. (See Fig. 1 for the typical shape of the Gartner Hype cycle).Footnote 4 It highlights the initial excitement, followed by a period of disillusionment, and ultimately the practical applications and benefits that emerge as the technology matures. The media and some experts often predict that a new technology will have a drastic, dramatic impact, without fully considering how long the process will take. We discuss several examples of this phenomenon to better evaluate the effects of Generative AI solutions, like ChatGPT, on the accounting and auditing profession.

Fig. 1
figure 1

Depiction of the Gartner Hype Cycle. The hype cycle and its stage indicators (adapted from Fenn and Raskino 2008)

2.1 Distributed ledger technology

Distributed ledger technology, including its most notable iteration, blockchain, has been frequently presented as a revolutionary tool poised to reshape the way we approach transactions and data management (e.g., Dai and Vasarhelyi 2017). Central to this narrative is the assertion that blockchain technology, with its capacity for enhanced security and transparent record-keeping, would bring groundbreaking changes to numerous sectors, accounting being one (e.g., Dai and Vasarhelyi 2017; Kokina et al. 2017). As an example, Dai and Vasarhelyi (2017) argue from an academic point of view that “blockchain is one of the most important and innovative technologies developed in recent years. … Accounting and assurance could be among the professions to which blockchain would bring great benefits and fundamentally change the current paradigms” (p. 5). The media and professionals echoed this hype. For instance, Casey and Vigna (2018) write in their opinion article “Blockchain will make today’s accountants (and many Wall Street jobs) obsolete” that “once account-keeping itself becomes fully automated and reconciliation functions become superfluous, both those who keep the books and those who audit them will be out of work.” Morehouse (2017) extended this view, arguing that “transactions that are recorded in real time and can’t be altered can be audited daily, eliminating the need for the expensive audits public companies are required to have every quarter.”

However, blockchain technology appears to have been overhyped. While research shows there are large investments in this technology and potential use cases for accountants and auditors (e.g., Kokina et al. 2017), others show that a full transfer to blockchains is infeasible (e.g., Coyne and McMickle 2017). To date, the adoption of blockchain and its effect on accounting can be categorized, at best, as modest.

Nevertheless, it is important to recognize that blockchain has indeed made tangible contributions to several sectors, with accounting and auditing being among them. Blockchain technology offers an array of possible benefits in these sectors, such as enhanced transparency, data immutability, smart controls, and transactional security as well as close-to-real time audits based on the increased auditability (e.g., Dai and Vasarhelyi 2017; Nordgren et al. 2019; Kwilinski 2019).

While these are indeed valuable improvements, they fall short of the game-changing transformation that was anticipated. In terms of future integration, Macaulay (2022) predicts more modestly that blockchain will become a component of SAP cloud services over the next five years. However, these types of ERP integrations are expected to provide only incremental improvements. Similarly, Oracle’s Blockchain Tables, which integrate blockchain technology into the Oracle Database, offer a good example of how blockchain can be used to enhance existing systems rather than revolutionize them. These tables improve security and data integrity and offer various practical advantages, such as facilitating auditing and securely storing compliance data (Rakhmilevich 2019). Finally, the Canton Network—a blockchain system that includes significant participants like Deloitte, Goldman Sachs, and Microsoft—offers another example of how blockchain can be integrated within existing regulatory frameworks to provide incremental improvements (Weiss 2023).

Taken together, research and practical application show that blockchain has a role to play in accounting but the initial hype has yet to come to pass. While blockchain does offer tangible benefits and improvements, the scale of its impact has not lived up to the original rhetoric.

2.2 Automation software including robotic process automation

Robotic process automation (RPA) software is the use of low-code or no-code software to automate repetitive, routine business processes (Cooper et al. 2019). It is a type of technology that uses software robots, called bots, to automate repetitive and rule-based tasks within computer systems. RPA software is designed to mimic human interactions with user interfaces and perform tasks, such as data entry, data manipulation, form filling, and more. It can work across various applications and systems, interacting with them just as a human user would.

Initial research showed that RPA had very impressive results: “One accounting firm shared that in 2017 they saved over one million human work hours from RPA, while another respondent discussed turning a task that took 16 hours to complete into a 17-second task. Firms also report seeing increased quality as bot accuracy approaches 99.9%, compared to human performance on the same task that is often closer to 90%” (Cooper et al. 2019; p. 16). Many press articles echoed these initial findings about the potential of RPA:

  • Headline: “RPA: the Most Important Megatrend You’re Not Hearing About” and relevant quote: “Practically, every profession in the world involves repetitive tasks. And in almost every case, a computer would do a much better job of carrying out these tasks. The possibilities of RPA are truly endless.”Footnote 5

  • Headline: “The Future of Accounting: How RPA And AI Will Revolutionize the Industry” and relevant quote: “AI, RPA, and other automated tech are transforming accounting, bringing increased precision, efficiency, safety, cost-savings and visibility.”Footnote 6

  • Headline: “The Impact of Robotic Process Automation in Accounting” and relevant quote: “New technologies are growing able to mimic human activity, taking on repetitive work more rapidly and accurately than people can. The authors offer an overview of Robotic Process Automation (RPA) in accounting that will change the ways the profession operates.”Footnote 7

  • Headline: “RPA: A Building Block of Transformative Automation” and relevant quote: “Companies [are] using [RPA] to revolutionize their workforces and accelerate more advanced automation efforts.”Footnote 8

Research does provide evidence of the benefits of RPA, such as greater efficiency, effectiveness, and auditor satisfaction (Kokina and Blanchette 2019; Cooper et al. 2022; Coyne et al. 2023a; Coyne et al. 2023a, b). However, research and practice have started to discuss the limitations of RPA, including failure rates sometimes as high as 50% (EY 2020; Moffitt et al. 2018), significant internal control and governance problems (Bakarich and O’Brien 2021; Eulerich et al. 2022, 2023a, c), and a short-term focus that hurts long-term success (Zhang et al. 2023a, b, c).

Once again, RPA is a revolutionary technology that did not live up to the hype. Accountants use it, but it has not dramatically changed the profession. It has found an important role in organizations and helps in the right circumstances improve efficiency and effectiveness.

2.3 Other technologies

While blockchain and RPA are relatively dramatic examples of the hype cycle, other technologies in accounting provide similar, albeit less dramatic examples. There is an ongoing discussion about other technologies with strong disruption potential, like the use of drones for auditing (e.g., Appelbaum and Nehmer 2017a, b; Christ et al. 2021), the implementation of process mining for auditing (e.g., Jans et al. 2014; Jans and Eulerich 2022), or using virtual avatars for interviewing in auditing (Pickard et al. 2016, 2020; CTStrategies 2018). Each of these technologies was held up as having the ability to disrupt accounting and auditing, but the changes they have made are more incremental than revolutionary.

John Williams, the head of the Association of Chartered Certified Accountants (ACCA), said it well: “The situation [of technology replacing accountants] isn’t anything new; if you take a look back to 25 years, [someone] may have predicted the end of accountants with the advent of software like SAP or and Oracle, but at this point, it’s quite clear that accountancy is the profession that managed to survive and thrive.”Footnote 9 This same sentiment could be applied to most other technologies that have been introduced with an intent to vastly disrupt the accounting industry.

2.4 Generative AI and ChatGPT

Based on the previous discussion, one might say that ChatGPT is overhyped and unlikely to prove a large disruptor of accounting and that eventually expectations will temper, as predicted by the Gartner Hype cycle. While this is plausible, it is often hard to know where a specific technology resides on the hype cycle diagram. This is particularly the case when technology can belong to several categories. For example, ChatGPT is based on a large language model, which is a type of generative AI, as a sub-group of AI. Thus it is unclear whether ChatGPT, language models, or (generative) AI is what is being mapped on a hype cycle. The former is relatively new and placed near the peak of inflated expectations as a new technology, while AI has existed for decades and is much more likely to be on the plateau of productivity.

AI has been studied in accounting for several decades, mainly looking at anomaly detection or decision-support through classification; however, recently, it has started to have a much greater impact on accounting practice. For example, research shows that AI improves management forecast accuracy, timeliness of earnings announcements, and precision in earnings forecasts (Rozario and Zhang 2023); increases firm value and performance (Chen and Srinivasan 2023); causes managers to be less aggressive (Estep et al. 2023); and improves internal and external audit quality (Christ et al. 2021; Fedyk et al. 2022; Emett et al. 2023a; Eulerich and Wood 2023). Other studies focus on the potential improvements in efficiency and effectiveness when using AI within a company (e.g., Jain et al. 2021; Choudhury et al. 2020; Tong et al. 2021). Most of the benefits described could be directly transferred to the accounting profession.

While this research shows positives to the use of technology and AI in accounting, it may come at the cost of accountants’ jobs. Fedyk et al. (2022) show pre-ChatGPT AI reduces the number of accounting employees, but the time it takes to reduce headcount and the number of employees displaced are relatively modest in size. They found that a one-standard-deviation increase in AI investments is associated with a reduction in accounting employees that reaches 3.6% after three years and 7.1% after four years. While this is meaningful, most would not consider it revolutionary.

Interestingly, all these studies were released before the widespread release of large language models like ChatGPT and Alphabet’s Gemini. Do these language models differ? Preliminary research suggests that large language models may have a larger impact than previous AI releases. Rather than focusing on studies that predict what will happen, we focus on discussing the few empirical studies that test the effects of ChatGPT on employee productivity and related topics.

Kreitmeir and Raschky (2023) studied what happened to Italian and other European professional coders’ individual productivity when Italy banned ChatGPT. Using a difference-in-difference design they showed that programmer productivity dropped 50% in the first two business days after the ban but then recovered, at least partially because of a swift implementation of the use of censorship bypassing tools (e.g., VPNs and using the TOR network).

Dell’Acqua et al. (2023) use an experiment to study the effects of ChatGPT 4 access on consultants at Boston Consulting Group. They showed dramatic improvements for tasks that AI can perform—consultants performed tasks 25.1% more quickly and completed 12.2% more tasks with quality 40% higher than a control group. Gains were most impressive for historically below average performers, who improved performance 43%, compared to above average performers increasing performance 17%. However, for tasks that are outside AI’s current abilities, consultants using AI were 19% less likely to produce correct solutions. Thus generative AI proved to be highly effective for many tasks but could harm performance for tasks that are outside of AI’s current abilities.Footnote 10

Other studies do not quantify the effects of ChatGPT but do show there is no qualitative difference in the creativity of humans and AI, with only 9.4% of humans being more creative than the most creative AI tested (Haase and Hanel 2023); Overall large language models, especially ChatGPT, have led to an increase in the quality, novelty, and creativity of content generated by humans (Sanatizadeh et al. 2023; Zhou and Lee 2023); ChatGPT performs better than previous tools at automatic bug fixing in computer programming (Sobania et al. 2023) and can significantly outperform sentiment analysis methods for forecasting stock market returns (Lopez-Lira and Tang 2023).

On the other hand, not using ChatGPT can dramatically impact the current and future success of companies. Bertomeu et al. (2023) show that the ban of ChatGPT in Italy had a negative effect on the capital market and the valuation of Italian companies. Finally, Eisfeldt et al. (2023) create portfolios of companies that have high and low exposure to generative AI like ChatGPT and find that higher-exposure firms earned excess returns that are 0.4% higher on a daily basis (which equates to over 100% on an annualized basis) than lower-exposure firms—suggesting that, “according to investors, ChatGPT represents an important shock to corporate valuations.”

The effects sizes in these studies are quite large, which suggests that ChatGPT could be more disruptive than previous technologies. Whether these new large language models will dramatically affect accounting will be determined in time. However, one necessary ability of ChatGPT to be transformative is that it must have sufficient knowledge of accounting content to perform accounting tasks. If ChatGPT is not good at being an accountant or auditor, then the predictions are unlikely to be realized. The evidence from Wood et al. (2023), discussed in the introduction, provides initial empirical evidence that ChatGPT is not capable of significantly reducing the need for professional accountants.

Although the initial evidence of ChatGPT in accounting was poor, additional models have been released and the community has gained additional experience and expertise in how to work with these models. As such, we empirically test ChatGPT’s current ability in relation to accounting content.

3 Methodology

We compare the performance of ChatGPT 3.5 and 4 models on questions from accounting licensure examinations.Footnote 11 We gather questions from four different licensure exams that are meant to cover the main areas of accounting, including financial topics (on several of the exams), internal and external auditing (on two exams), management accounting (on one exam), and tax accounting (on two of the exams).Footnote 12

  1. 1.

    CPA exam: we use questions from Becker CPA exam preparation guides. We only include questions from the main course in our analyses. The CPA exam has four parts:

    1. a.

      Auditing and attestation (AUD).

    2. b.

      Business environment and concepts (BEC).

    3. c.

      Financial accounting and reporting (FAR).

    4. d.

      Regulation (REG).

  2. 2.

    CIA exam: we use questions from the global Institute of Internal Auditors (IIA) multiple choice training system. This exam is translated into various languages. We use questions translated into German. The global CIA certification had three parts:

    1. a.

      Part 1: Essentials of Internal Auditing

    2. b.

      Part 2: Practice of Internal Auditing

    3. c.

      Part 3: Business Knowledge for Internal Auditing

  1. 3.

    CMA exam: We use questions from Becker CMA exam preparation guides. The CMA has two parts:

    1. a.

      Part 1: Financial Planning, Performance, and Analytics

    2. b.

      Part 2: Strategic Financial Management

  2. 4.

    EA exam: We use questions from Gleim exam preparation (posted online), the enrolledagent.com exam prep website, and the IRS exam preparation website. The sections of the EA exam include:

    1. a.

      Part 1: Individuals (IND)

    2. b.

      Part 2: Businesses (BUS)

    3. c.

      Part 3: Representation, Practices, and Procedures (RPP)

For all exams, we only keep questions that do not have images in their text. We also only include multiple choice questions and not workout type questions. To the extent workout type questions resemble university case studies, research suggests that ChatGPT 4 can perform reasonably well on most of these types of assessments (Chen et al. 2023). However, to expedite testing, given the fast-changing nature of this technology, we omit testing of these types of assessments and the timely grading necessary to evaluate them.

We tested the differences between the 3.5 and 4 models. We also perform additional tests to see whether we can boost the performance of the ChatGPT 4 model. Specifically, we also provide few-shot training. Few-shot training is a method in which the model is provided a few examples before submitting questions for testing (Wang et al. 2020). Few-shot training usually ranges from submitting two to five examples, but it can also use up to 100 examples (Wang et al. 2021). To compensate for the limited number of training examples, models in a few-shot context would require some prior information (e.g., a pre-trained language model). GPT 3.5 and GPT 4 are both pre-trained models.

For our few-shot training, we randomly sampled 10 questions and used these to train ChatGPT. Submitting questions is called “prompting” the AI. We follow OpenAI’s (2023) guidelines to engineer our prompt. When prompting through the OpenAI API, we can also set the level of creativity of the model using the TEMPERATURE hyperparameter. By setting the temperature to zero, we eliminate randomness in models’ responses and reduce creativity. As we are measuring demonstrably correct answers, creativity in responses was not desirable. In practice, the model should provide the same response every time we prompt the same question with the temperature set to zero.

Finally, we advance our model through reasoning and acting. To this end, we follow Yao et al. (2023) and Schick et al. (2023) and introduce agents to ChatGPT 4. Agents can be thought of as enabling tools. Agents allow a large language model to accomplish the tasks that a human would do, such as using a calculator for math or using search engines for information gathering. Using agents, a large language model can also write and run Python programming or even query an SQL database. In some testing, we allow ChatGPT to use agents to access a calculator and perform web searches.

Furthermore, we take advantage of chain-of-thought prompting. Wei et al. (2022) demonstrate that large language models can construct chain-of-thought responses when given examples of chain-of-thought reasoning in the prompt. Chain-of-thought reasoning can be thought of as decomposing a larger problem into intermediate steps to arrive at the final answer. This is also called reasoning. ReAct is an abbreviation for the combination of reasoning and acting. Appendix 1 shows an example of ReAct prompt with the outcome. As illustrated, the model states the steps that are needed to solve the problem (reasoning) and uses search and calculator to get the information needed to solve the problem (acting). In the example, the model looks up the current dollar to euro exchange rate through a web search and uses the calculator to compute the final answer.Footnote 13

Since decision-making and reasoning are built into a large language model, ReAct has several features that make it stand out. First, creating ReAct prompts is simple as users can simply enter their thoughts on top of their queries. Second, ReAct works for a variety of activities with various actions and reasoning requirements, including using a calculator, fact verification, executing code, online search. Third, Yao et al. (2023) find that ReAct regularly outperforms baselines with only reasoning or acting across diverse domains. Lastly, and most importantly, ReAct offers an interpretable sequential decision-making and reasoning process in which users may readily evaluate reasoning and factual accuracy (Yao et al. 2023). In this way, it provides insight into how it solves a problem.

Each time we test a set of questions, we perform it in a different session, meaning the model will not consider any previously entered questions. Table 1 shows descriptive statistics of the number of questions we use for each testing phase. The sample sizes differ by exam because of the number of different questions in the review material.Footnote 14 The sample sizes differs as we add complexity because the cost of running the more advanced models increases. Given our sample sizes are all above 150 for each section of each exam, this choice is unlikely to bias our results. We also list in the table the minimum score necessary to pass each exam. The notes to the table contain descriptions of how we reached these minimums for tests that do not have a hard-set threshold.

Table 1 Descriptive statistics

4 Results

We start our analysis by examining the performance of the ChatGPT 3.5 model. Table 2 contains the results for using the 3.5 model for each section of each exam. The results suggest that scores range from a low of 37.3% for the individual portion of the EA exam to a high of 68.0% for Part 3 of the CIA exam. None of these scores are above the threshold necessary to pass a section of the exam. The overall average of these scores resembles the average score on accounting assessments observed by Wood et al. (2023): the average for certification exams is 53.1%, and the average for accounting assessments was 56.5% (see their Table 4). Also, as in their results, GPT 3.5 struggles most with tax questions and does better with auditing questions.

Table 2 Model Performance for ChatGPT 3.5 and 4 with zero-shot training

Table 2 also presents the results when we use the GPT 4 model. With this newer model, performance improves substantially, ranging in improvements from 9.2 to 24.7% with an average improvement per exam section of 16.5%. Table 2 shows that based on this higher performance, the GPT model passes 5 sections of exams, including all the sections of the CIA exam. Still, the model does not fully pass any of the other certifications.

Table 3 repeats the GPT 4 results from Table 2 in the column labeled “Zero-Shot,” meaning this column shows performance of GPT 4 without any training. Table 3 adds the new column of “10-Shot” that shows how the GPT 4 model performs when it is prompted with 10 examples. The results show an additional average improvement of 6.6% to the model performance. With this improvement, the model can now pass both sections of the CMA exam.

Table 3 Model performance for ChatGPT 4 with 10-shot training

Table 4 repeats the “10-Shot” column from Table 3, labeled as “No ReAct.” This table then adds the ability to reason and perform actions (ReAct) to the GPT 4 model. With this new ability, the model shows an additional improvement of 8.9%. Importantly, the model can now pass all sections of each exam. One major reason ReAct improves performance so much is that the model can now use a calculator. Failure with calculations is a major reason why ChatGPT struggled in financial and tax areas (Wood et al. 2023).

Table 4 Model performance for ChatGPT 4 with 10-shot training and ReAct

We present a visual summary of our results in Fig. 2. Figure 2 shows the performance of the ChatGPT 3.5 model and then adds each additional step. The visual clearly shows that the improved models can easily clear the threshold for each certification exam.

Fig. 2
figure 2

Model performance improvement. Performance improvement in each section of each exam

4.1 Additional analyses

The CPA exam training material separates problems into two categories, application and remembering and understanding. To show how each step in the model process improves the overall performance in each of these categories, we tabulate how each model we previously tested performs on these two types of questions. As shown in Table 5, the performance improvements of using ChatGPT 4 and adding few-shot training have similar effects on application questions as on remembering and understanding questions. In contrast, adding the ReAct abilities to the model has a much more pronounced effect on application questions. This is consistent with the results of Yao et al. (2023), who find that reasoning and acting substantially improves model’s ability to answer more complex questions.

Table 5 Model performance improvement for CPA question types

We provide an additional sensitivity training in Appendix 2 about the optimal level for the number of training shots. This test can only be performed on older models. Our findings suggest that training of 3,000 to 4,000 examples should further enhance performance by around 6%. Professionals wanting to implement ChatGPT in practice should consider using more training to further enhance performance. Additional training beyond this threshold can hurt model performance.

5 Discussion of possible ChatGPT disruption in accounting

Technological development is a process of continuous evolution, characterized by successes, disappointments, and improvements. While new technologies may initially face skepticism and fail to meet expectations, they typically become more reliable and effective over time. This progress is driven by iterative innovation, where developers learn from experiences, and societal adaptation, as users discover new applications. In the long run, this process often leads to technologies becoming better and more integrated into our lives, despite the challenges they may face in their early stages.

While skepticism is a healthy part of any decision-making process, it is important to balance it with openness to innovation. Staying stuck in skepticism about emerging technologies may lead to missed opportunities. These could include benefits, such as increased efficiency, cost savings, competitive advantage, or even the chance to pioneer a new field.

In our opinion, one key difference between a technology being overhyped or not is the degree to which users and prognosticators talk about what will be possible with the technology versus discussing what is currently being done with it. The more language about possibilities rather than realities suggests the technology is likely to be overhyped. So, in addition to the certification exam evidence, we discuss ways in which ChatGPT technology is being used in accounting and auditing, including accounting education, as of November 2023.Footnote 15 In each section, we also discuss challenges and future possibilities that exist because of the emergence of generative AI.

5.1 Generative AI in accounting education

Generative AI is already proving to have dramatic effects in education. For instance, at one of the author’s institutions, an introductory information systems course created a chatbot based on the class’s textbook and other materials (e.g., syllabus).Footnote 16 This class serves many students and as such employs 30 teaching assistants (TAs), who can answer questions in a virtual lab. In half a semester, the students in the class had approximately 51,000 interactions with the “TA-bot,” compared to only 108 interactions with TAs in the virtual lab. The chatbot took less than 20 min to build and costs $200 a month to run. The professor estimates it is more than 95% accurate in responding to queries. In comparison, the TAs cost approximately $22,500 a month and are more than 95% accurate in their responses. In this situation, it appears that students significantly prefer using a chatbot, the chatbot produces similar high-quality answers, and it can drastically reduce costs.Footnote 17

In a similar vein, the textbook publisher Pearson has announced plans to introduce generative AI into its online textbooks (Hughes-Morgan 2023). One of the authors has experimented with this technology, and it allows the learner to ask any question about material in the textbook (e.g., “summarize the main points of this chapter in five bullet points,” “explain concept ‘x’ in simpler terms”) and to generate practice questions to test self-mastery (e.g., “create multiple choice questions to test me on the keywords from this chapter”). The chatbot for the introductory class and the Pearson textbooks is less likely to hallucinate than publicly available generative AI models, like ChatGPT and Gemini, because the responses are constrained to only use the text provided to the model.Footnote 18

Some accounting educators are making significant use of ChatGPT to produce content. For example, ChatGPT (or related technologies) was used to create datasets, accounting scenarios, images, and solution guides for accounting cases and homework problems. As one particular example, the EY ARC cybersecurity accounting case Digital Dungeons is an escape room where students must figure out a numeric code to answer the case. To see whether they are right, the developers used ChatGPT to create the HTML code for a website. With just plain-language prompts, the website incorporates graphics (which were developed using AI) and submission forms and buttons. Furthermore, ChatGPT could encrypt the answer in the HTML code and add a submission delay so that each incorrect submission resulted in the user waiting an additional second before being able to try again. All of this was programmed in less than 30 min.

As another example, the TechHub.training website provides visitors with challenges to enhance their digital literacy (Wood et al. 2023). Student authors write and review all the case materials but use ChatGPT to enhance their work. Similar to the EY ARC case, data, solutions, case descriptions, etc., were developed, refined, or improved by using ChatGPT. Students report that ChatGPT significantly enhanced the quality of their work and the efficiency in producing it.

Professors are experimenting with using generative AI to provide feedback and to grade student submissions (Pinto et al. 2023; Chen et al. 2024; Jukiewicz 2023). The results, to date, are mixed, and additional work is needed to understand both how and when generative AI can help faculty grade. However, the possibility of using generative AI to grade unstructured submissions (e.g., essays) would make it possible to better align assessment with learning objectives, rather than using less effective testing because of limitations in faculty time (Kuechler and Simkin 2005).

Generative AI is also shaping the production of academic research. Vakilzadeh and Wood (2023) have created a beta-version of a tool to help automate literature reviews. The tool allows authors to use generative AI to understand and synthesize research. The tool can be used, among other things, to generate the first draft of a literature review, identify conflicts or gaps in research understanding, and brainstorm research questions to address. The tool has already helped draft literature reviews for papers, succinctly summarize papers for reviews and promotion and tenure packets, and interpret academic research for business professionals.

Indeed, ChatGPT has the potential to revolutionize some existing research methodologies. Consider qualitative research, which collects significant written material from interviews, surveys, or other data collection means. Qualitative scholars must spend significant time reviewing and coding data. Generative AI tools may be able to better perform some of these tasks. For example, Zhang et al. (2023b) could bolster thematic analysis by using ChatGPT, finding that “[large language models] (such as ChatGPT) can conduct qualitative analysis on corpora through well designed prompts, addressing concerns of human analysts” (p. 22). These same authors then develop a tool that “not only refines the qualitative analysis process but also elevates its transparency, credibility, and accessibility” (Zhang et al. 2023a; p. 1). Even if generative AI proves to be less effective than humans at qualitative research, providing the corpora of data from a qualitative research project for other scholars to examine using generative AI can significantly increase the impact the collected data can have.Footnote 19 Certainly, more research is needed on the positive and negatives of using generative AI for qualitative research, but the potential of these tools is significant.

Additional academic tasks ChatGPT enhances include the production of research proposals (Chen et al. 2024), copyediting manuscripts and textbook materials, translating materials to foreign languages, writing emails, brainstorming ideas, finding relevant research (especially when using ChatGPT internet plugins or ChatGPT through the Bing search engine), producing presentations, and summarizing research papers.Footnote 20 As authors, we use this technology on a daily or near-daily basis in these and other tasks.

We do acknowledge that ChatGPT does have problems. ChatGPT, like humans, can hallucinate. ChatGPT is best thought of as a very good, though imperfect, assistant. Designing how AI should work with humans, including the appropriate review processes, will be important for future research. (See additional discussion by Huang and Vasarhelyi 2019.)

So what will the future hold in education and scholarship in a generative AI world? We highlight a few potential ideas for how things may change. In terms of scholarship, the journey of publishing the Wood et al. (2023) manuscript is illustrative. That paper began about two weeks after the release of ChatGPT 3.5 to the public (i.e., mid December 2022). Final notice of the acceptance of the paper was received on March 15, 2023—meaning from initial idea to final acceptance took only three months. Yet, the day before final acceptance, ChatGPT 4 was released. ChatGPT 4 substantially improved upon the ChatGPT 3.5 model such that the basic results of Wood et al. (2023) showing students outperformed the generative AI model were put in serious doubt.

Given that pace of change in the AI sector is so fast, how will academic scholarship keep pace using our current knowledge production and reviewing model? At least in accounting, we are unaware of a paper that was produced, reviewed, and accepted as fast as the Wood et al. (2023) study, and, even so, that paper was somewhat obsolete upon acceptance. If accounting scholars are going to contribute research findings to guide cutting-edge technology or other fast-paced changes, the model for producing accounting scholarship will likely need updating and improvement.Footnote 21

In the classroom, the ability to provide mass customized education is now closer to reality. Generative AI can adapt learning materials to the interests of individual students and can help guide students to better self-diagnose their understanding and then cater materials to their continual development. The divide in performance between students who want to learn and to excel compared to those who are just checking a box will likely grow. Generative AI will enable dedicated students to advance faster and achieve mastery sooner, while students who are just getting by will be more likely to cheat and over-rely on technology to the detriment of their longer-term learning and progression.

Another change in education will be that faculty will increasingly be more a “guide-at-the-side” to students rather than a “sage-on-the-stage,” meaning that professors will have to focus more on guiding learners to self-teach and explore rather than having all the answers and just sharing them with students via lectures. The amount of knowledge that is now even more easily accessible via generative AI chatbots will decrease the need for faculty who just know a lot and increase the need for faculty who can help others learn how to teach themselves.

5.2 Generative AI in accounting and business

There appears to be significant use of ChatGPT by employees. A survey by 11,793 professionals using the networking app Fishbowl finds that 43% of respondents indicate using ChatGPT at work and 68% haven’t disclosed the use of it to their boss.Footnote 22 The current use of ChatGPT in business runs from the very basic to complex. Several basic uses of ChatGPT in business include using ChatGPT to generate basic emails, using it to translate emails for multinational corporations (Emett et al. 2023a), and using it to “to quickly write reports and prepare compliance documents, analyze and evaluate business strategies, [and] identify inefficiencies in operations or create marketing materials and sales campaigns” (Loten 2023). EY reports that board members are using “generative AI in real time during board meetings as an additional input to brainstorm counterpoints, tweak scenario planning and summarize trends. As one director put it, ‘We can use AI almost like a copilot’” (Kanazawa et al. 2023). A survey conducted by KPMG of 2010 companies with more than $1 billion in revenue and 500 or more employees finds that 65% are already using AI in financial reporting and 48% have deployed or are piloting generative AI in their organizations.Footnote 23

In terms of more sophisticated use, the large accounting firms are starting to develop their own generative AI models. PwC reports entering into a global partnership with AI startup Harvey, backed by the OpenAI Startup Fund, to provide its legal business solutions professionals with exclusive access to Harvey’s AI platform, which uses natural language processing, machine learning, and data analytics to enhance legal work (PWC 2023b; O’Dwyer et al. 2023). The platform will be used to support PwC’s global clients, enhancing the ability of PwC’s network of legal professionals to deliver solutions in areas such as contract analysis, regulatory compliance, and due diligence. For several years, EY has been using OpenAI’s GPT engine to develop its own applications. One of their creations is an AI-driven document reader and classification system, which they use for categorizing receipts and tax-related considerations, demonstrating their incremental approach to the technology’s application (Wilkinson 2023).

EY is using ChatGPT in Azure OpenAI to innovate its payroll services as part of its Next Gen Payroll Platform. It has developed a prototype for a payroll chatbot that can handle complex employee queries using a large language model to analyze extensive compliance data. The EY Intelligent Payroll Chatbot is designed to reduce employers’ workload by over 50% by answering intricate payroll questions and offering a personalized employee experience. It can understand the specifics of an individual’s pay slip and link regulatory compliance with company policies for detailed responses and personalized explanations (EY 2023).

Bloomberg has developed a new large-scale generative AI model called BloombergGPT. This large language model is trained on a wide range of financial data to support various natural language processing (NLP) tasks within the financial industry (Wu et al. 2023). BloombergGPT is designed to improve financial NLP tasks, such as sentiment analysis, named entity recognition, news classification, and question answering. It will also unlock new opportunities for using the vast quantities of data available on the Bloomberg Terminal to better serve the firm’s customers (Haas and Gilmore 2023).

Emett et al. (2023a) report that Uniper, an international energy company, is using ChatGPT in the internal audit function, testing its use in audit preparation, fieldwork, and audit reporting. Initial reports suggest efficiency gains ranging from 50 to 80%.

This discussion could continue with the many creative and innovative ways that companies are using generative AI. Indeed, OpenAI reports that more than 90% of Fortune 500 companies are building tools on its platform.Footnote 24 If we step back, what are the larger takeaways that we are seeing for the effects of generative AI on accounting? So far, we have not seen evidence that generative AI results in accounting job loss; however, survey evidence suggests that 26% of employers are considering reducing headcount because of implementation of ChatGPT.Footnote 25

Emett et al. (2023b) find that board members, senior management, and heads of internal audit agree that any savings in assurance work from automation (of any kind) will not be redeployed into increasing the amount of assurance but rather allocated to non-assurance (i.e., consulting) activities. This suggests that accounting firms are likely to see profitability erosion from AI in their audit work and continued growth in providing non-audit services (see Fedyk et al. 2022 for pre-ChatGPT AI evidence on fees). Our discussions with accounting professionals suggest accounting partners are considering whether AI can replace offshoring work as a first area to automate.

Note that ChatGPT deployment is still very modest in accounting, especially at smaller accounting firms. Recently, one of the authors spoke with nine managing partners for regional accounting firms. These firms are just starting to learn about ChatGPT and consider how to use it. While large firms have billions of dollars to invest in these technologies, smaller ones do not and implementing generative AI may not be immediately feasible. This could result in an increased gap between services offered by large and small accounting firms and the necessity for companies that are using technology to have to work with large accounting firms that can understand AI technologies.

In our opinion, it is clear that generative AI is already starting to impact the accounting and business fields. The research evidence and our experiences suggest that generative AI is not just hype but is already being used in substantive ways. The question is not whether generative AI will influence accounting but how much. Our early observations are that generative AI may not be hyped enough for its potential to change the accounting industry.

However, while generative AI in auditing and accounting promises numerous benefits, it may also bring challenges. For example, professionals might depend too much on AI, leading to a decline in essential skills and judgment. Data privacy and security are major concerns, given the sensitive nature of financial information handled in these fields. The accuracy and reliability of LLMs, particularly in complex scenarios, are not foolproof, posing risks in decision-making. Additionally, there are ethical and compliance issues since, in their current form, LLMs may not fully align with the strict standards of the accounting profession.

The potential for job displacement due to automation, especially in routine task areas, raises socioeconomic concerns. The cost of implementing and maintaining AI systems can be high, potentially excluding smaller firms from leveraging these technologies. Training and adaptation for current professionals represent another layer of challenge. Furthermore, biases inherent in AI algorithms and the lag in regulatory frameworks adapting to these advancements present risks that cannot be ignored.

6 Conclusion

Technological advancements continue to have a significant impact on business and accounting (Masli et al. 2011; Moffitt et al. 2016; Austin et al. 2021; Richardson and Watson 2021; Eulerich et al. 2023a, b, c). The most recent advancements in AI, large language model chatbots, will likely continue this trend. The degree to which they will impact accounting depends on their ability to perform accounting tasks at a high level. We test this ability by seeing how well one of these chatbots can perform on accounting certification examinations.

The results of our study demonstrate that ChatGPT can perform sufficiently well to pass important accounting certifications. This calls into question some of the competitive advantages of the human accountant relative to the machine. To our knowledge, for the first time, AI has performed as well as the majority of human accountants on real-world accounting tasks. This raises important questions about how the machine and accountant will cooperate in the future. We encourage research to help understand where machine and human abilities are best deployed in accounting. We also encourage research that advances the capabilities for machines to perform more accounting work—freeing accountants to innovate and add greater value to their organizations and society. Footnote 26

We make several additional suggestions for future research. We equipped ChatGPT with a calculator for computation tasks and a search engine to find out more about the topics in the questions. However, we observed that the search agent does not always provide useful information to ChatGPT. Future research can investigate whether agents that retrieve information from reliable and more specialized resources improve model’s performance. An agent, for example, can be programmed to retrieve information from the PCAOB audit standards or the IRS tax publications. Similarly, researchers might study whether human feedback as an intermediary step of the chain-of-thought can improve the performance of the model.

Another area for future research is AI transparency. We find that using ReAct substantially enhances transparency about ChatGPT’s decision-making. We note that the model is more likely to explicitly state that it is unsure of what to do or is making a guess to answer. Similarly, under these settings, the model is more likely to respond “I don’t know.” Although not empirically tested, we anecdotally note that ChatGPT tends to hallucinate less when we use ReAct. Future research can investigate ways through which transparency can be enhanced and whether making ChatGPT an agent improves accuracy and reduces hallucinations.

Considering that AI deployment in accounting is already happening, there is also a need for research in auditing AI. Research on AI auditing has focused on evaluating whether specific applications meet predefined industry requirements. For instance, researchers have created procedures for auditing AI systems used in recruitment (Kazim et al. 2021), online search (Robertson et al. 2018), and medical diagnostics (Liu et al. 2022). As AI becomes more prevalent in corporate operations, AI auditing from a corporate governance perspective becomes even more important. While some studies propose frameworks from a governance perspective (e.g., Mökander et al. 2023), auditing AI remains an important, underexplored area for future research.

Our study is subject to several limitations. First, it omits from testing questions that require greater cognitive ability, such as interpreting situations and contexts and interpreting visualizations. Future studies should continue to probe how AI and related technologies can perform these more advanced functions. Second, we test practice exams rather than actual exams, as the actual exams are not available. Third, although our results suggest ChatGPT can respond to questions, we do not test whether it can perform actual accounting tasks, such as bank reconciliations, tax preparation, closing the books, etc. We encourage research that can demonstrate whether AI can move from knowing to doing. ChatGPT and related technologies are exciting modern technologies. We encourage their continued study and implementation in practice.