Introduction

Global environmental assessments (GEAs) such as those developed by the Intergovernmental Panel on Climate Change (IPCC), the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES), Global Environmental Outlook (GEO), the Global Sustainable Development Report, and UNEP Emission Gap and Adaptation Gap reports are crucial to provide information and foster a better understanding of the causes and consequences of natural and human activities on the environment and to develop action-oriented and effective solutions which promote planetary health and wellbeing (Castree et al. 2020). In the past, these assessments have informed key policy decisions and helped to identify gaps in current environmental governance frameworks (Jabbour and Flachsland 2017). The IPCC for example has a long history of informing national and international policy and negotiations, for instance through the United Nations Framework Convention on Climate Change (UNFCCC). The panel’s recommendations have underpinned the development of national and international climate policies, including the Paris Agreement (Ourbak and Tubiana 2017). More recently, the reports of the 6th IPCC Assessment cycle (thereafter IPCC AR6) have played a crucial role in raising public awareness about the urgent need to address climate change (IPCC 2023) and informing the Global Stocktake on achieving mitigation, adaptation and means of implementation targets. A few concerns though have been raised for GEAs (Vardy et al. 2017; Castree et al. 2020), not least the exponentially growing body of evidence from the peer-reviewed and grey literature (Stocker and Plattner 2014; Berrang-Ford et al. 2021a; Palutikof et al. 2023). A bibliometric analysis of the Web of Science has shown that the number of unique articles on climate change has grown from almost 40,000 between 2010 and 2015 to more than 57,000 between 2016 and 2021. During this time, there has also been a clear shift from the physical science of climate change to the topics of impacts, mitigation and adaptation (Khojasteh et al. 2024). Consequently, the number of references that have been used in each IPCC assessment cycle has also rapidly increased. For example, in the 5th assessment cycle, the IPCC authors assessed approximately 30,000 publications,Footnote 1 whereas in the 6th assessment more than 66,000 publications have been assessed.Footnote 2 Not only the number of references has increased, but over the years, the scope of GEAs has become more diverse. Today’s environmental assessments draw on a much wider range of fields, including natural and social sciences, economics, anthropology, psychology, engineering and the humanities (Callaghan et al. 2020). Similarly, the readership of these reports has increased, from mostly negotiators and national governments in early assessments, to a range of (non)-governmental organisations, civil society, and businesses across the globe.

Given the growth in environmental literature and the expanding scope and readership of global GEAs, traditional methods of synthesizing work are increasingly challenging.

Artificial intelligence assisted synthesis work

Assessing the evidence from many diverse sources and disciplines is an incredibly demanding task. The authors need to search, collect and systematically assess the literature to synthetise evidence and extract coherent narratives that can be traced back to their original sources (Callaghan et al. 2020; Berrang-Ford et al. 2021a). At the same time, machine learning (ML)-assisted research synthesis has grown in popularity amongst the environmental research community (Callaghan et al. 2020; Berrang-Ford et al. 2021a; Lydiri et al. 2022; Sietsma et al. 2024a). Advances in natural language processing (NLP) have led to pre-trained large language models (LLMs), which are currently standard in the field.

(Brown et al. 2020). In the climate domain, studies have used transformers based LLMs from the BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) family for analysing corporate climate disclosure (Bingler et al. 2022) and for climate impact attribution (Callaghan et al. 2021). Some machine learning methods have also helped in the IPCC AR6 cycle. For example, supervised and unsupervised learning techniques have been used to assess the evidence of human adaptation to climate change (Berrang-Ford et al. 2021a), the extent of climate change health related topics (Berrang-Ford et al. 2021b) and the existence of adaptation limits (Thomas et al. 2021). Similarly, these methods allow for multilingual assessments and automated translation of texts, which addresses a frequently heard critique in global assessments namely the reporting bias towards English language evidence (Sietsma et al. 2023). For example, machine translation (MT) can be integrated at the system level to facilitate multi-language support. In this process, non-English queries are initially translated into English to allow the LLM engine to produce an English response. Subsequently, this response can be translated back into the original language (Thulke et al. 2024). Artificial intelligence (AI) supported screening of the rapidly increasing body of literature can save significant time and scarce resources (van de Schoot et al. 2021) as well as creating “living evidence synthesis platforms” that allow for continuously updating the expanding pool of scientific literature to be included in the assessment (Elliott et al. 2021). One way to do this is by using AI Research Assistant tools to survey the literature prior to evidence synthesis which allows to find papers, extract data and arrange them in concepts (De-Gol et al. 2023). In a next stage, AI tools can assist authors in determining the level of confidence in their statements by more systematically and transparently characterising the level of agreement and the level of evidence (Mastrandrea et al. 2011). To pursue this, authors can adopt a fact-checking approach (Leippold et al. 2024). This architecture enables LLMs to proficiently integrate diverse scientific evidence and iteratively use a mediator-advocate framework to converge towards a final assessment based on multiple lines of evidence. This somehow mirrors a human expert judgement elicitation, which is often at the core of such assessments (Majszak and Jebeile 2023). Fact checking can also go a step further and be used to check for consistency across different elements of a report, as for example amongst the chapters and to illuminate about potential knowledge gaps. These advances in AI have led to vastly increased usage of text-as-data methods in the context of environmental and climate change and beyond, which so far has only received limited attention (Stede and Patz 2021). In the next sections, we critically discuss the capabilities, opportunities and usefulness of question answering tasks (QA) following the emergences of tools deployed in 2023, primarily ChatClimate (Vaghefi et al. 2023) which was developed by a group of authors of this paper, but for completeness also ClimateQA (Lelong et al. 2023) and ClimateGPT (Thulke et al. 2024).

Leveraging the benefits

Question Answering (QA) involves using natural language processing (NLP) techniques to develop models that can understand and respond to questions posed in natural language (Van Dis et al. 2023). These models are trained on large datasets and can identify and extract relevant information to provide an answer to a user’s question. Powerful LLMs trained for QA can comprehend complex questions, interpret their underlying meaning and context, and apply the knowledge they have acquired during training to generate accurate and informative answers to a query. LLMs can be further extended to chatbots, offering users an interactive and intelligent dialogue in a contextually relevant manner (Stokel-Walker and Van Noorden 2023). They are estimated to have considerable economic, social and policy implications (Eloundou et al. 2023). However, they are subject to two major challenges: hallucination (i.e. generation of text that is not grounded in factual information) and outdated information that only dates to the time of training (Leeming 2023). In domains like global change, getting accurate, up-to-date information from trustworthy sources is essential. A possible solution has been to (i) give LLMs access to additional, scientifically acknowledged resources to keep their knowledge up to date and avoid spreading disproven, outdated or misleading information, and (ii) ask a wide range of experts to check the reliability of answers (Gao et al. 2023).

Following this rationale, domain-specific chatbots like ChatClimate, ClimateQA and ClimateGPT have been deployed to enhance LLMs, through the integration of information from databases of relevant documents, namely the IPCC Assessment Reports, the IPBES and documents of well-established organisations such as the World Meteorological Organisation (WMO) (Vaghefi et al. 2023). These chatbots can be utilised in a workflow jointly or in tandem designed with climate researchers, streamlining the process of adding literature to databases and accelerating literature search, collection and cross-referencing efforts. This is illustrated in Fig. 1 for the specific case of ChatClimate. A detailed technical description of the chatbot is beyond the scope and the reader is referred to the overview by Vaghefi et al. (2023).

Fig. 1
figure 1

QA-Experts workflow. The QA-Experts workflow starts with experts adding extra (high quality and trusted) knowledge, such as IPCC and/or other GEA reports and peer-reviewed articles to the database (left-hand side of the panel). Then, experts formulate the appropriate questions from the body of literature and feed questions to the chatbots (left-hand side of the panel). Based on a semantic search, related texts to each question are retrieved through a contextual compression system (middle part and right middle). The experts finally check the answers and check the references within (bottom right) (adapted from Vaghefi et al. 2023)

To ensure that the scientific community remains effective in synthesizing and timely conveying knowledge, leveraging this QA capabilities of chatbot built upon LLMs in support of authors and report beneficiaries can be particularly useful. For authors, this is a help with respect to reports relying on very diverse literature, stemming from various disciplines. For users of the reports, these systems can play a role in extracting and processing information tailored to their questions. Chatbot tools based on curated corpora can be both innovative and powerful with respect to the types of questions end-users of reports want answers to, allowing reports to be scoped in a way that not only better addresses the needs of decision-makers, but also that of other end-users. As illustrative example, we perform a semantic analysis of the three assessment reports of the IPCC AR6 (ESM Fig. 1, left panel) and of the questions asked to ChatClimate between April and November 2024 (ESM Fig. 1, right panel). By means of world clouds, we can see that there is reasonable agreement in terms of the most common words in the reports and the questions asked to that report. For example, in both cases, the word “risk”, “emission” or “warming” are present with similar frequency. Notwithstanding, we notice how the words “impacts” or “adaptation” are hardly visible in the questions world cloud despite being very prevalent in the IPCC reports. A fairly simple analysis of the questions across all tools can help steering, for example outreach materials after publication, targeting the broader community and customizing content for different user categories. A community of users can be created which interact with authors through specific chatbots to elicit aspects of the report in full spirit of knowledge co-creation (Mauser et al. 2013). Additionally, LLMs can be used for factchecking to counteract the spreading of misinformation, i.e. to check the validity of claims made through various media and social media platforms (Chavalarias 2022; Leippold et al. 2024; Schimanski et al. 2024).

Testing response accuracy

In this section, we demonstrate the performance and limitations of using chatbots through practical experiments, utilising the IPCC reports as a testbed for these technologies. These experiments can of course also be adapted accordingly to other GEAs.

In a first experiment, three different instructions are provided to ChatClimate on how to answer a query. As outlined in the previous section, a crucial step is the integration of domain-specific material on top of the general and large quantity of text data that it is provided to LLMs during training (Fig. 1). As it can be seen in Table 1, the same question is asked to the underlying LLM (GPT-4), but only in the hybrid and standalone cases the model is fed with additional information from the IPCC reports. Furthermore, prompt engineering is used in the standalone case to force the model to only base its answers on the IPCC reports. It can be noticed that the answer returned using only GPT-4 is seemingly correct, but not necessarily focused and lacks the nuances and details that are more discernible in the other two cases. Furthermore, the use of in text citations with possibility to retrieve the page number from the hybrid and standalone models is certainly an advantage as it allows to trickle back statements to the source of information.

Table 1 The table compares the three different model setups, namely GPT-4 is only using knowledge from the training material (GPT-4), GPT-4 is instructed to use IPCC reports on top of the in-house knowledge (GPT-4 + ChatClimate) and finally in the standalone set-up GPT-4 is instructed to only use the IPCC reports to answer the query (Standalone ChatClimate)

The second experiment focuses on the ability of LLMs to capture nuanced concepts such as accelerating climate change and tipping points. In this experiment, ChatClimate was first given only the AR6 Synthesis Report (IPCC AR6 SYR) to formulate its answer, and then, it was given all underlying IPCC Sixth Assessment Reports but excluding the IPCC AR6 SYR, to answer the same question. In the supplementary material, ESM Table 1 shows the answers of ChatClimate when using only the IPCC AR6 SYR versus the case where all reports except the IPCC AR6 SYR are included. The answers to the question of acceleration are evaluated by the authors as being inaccurate in both cases. ChatClimate seems to confuse the concept of acceleration with “rate of change”. An acceleration is mathematically distinct from a rate of change and the only conclusion in the synthesis report on acceleration is accelerating sea level rise. Hence, a limited mathematical characterisation of “acceleration” in the report (except for evidence on accelerating sea-level rise) might have given rise to the confusion. The answer to the second question on tipping points shows that using only the IPCC AR6 SYR leads to a more accurate response than the case where it is excluded. This experiment thus highlights the value of iterative learning when preparing the data source and the importance to involve the authors who have actually produced the data source, particularly when it comes to concepts that have multiple meanings.

In the third and last experiment, the answers provided by the three different tools ChatClimate, ClimateGPT and ClimateQA were compared. Again, we asked a question which requires accuracy. The chatbots are asked how much of climate change is due to fossil fuels. The results are summarised in ESM Table 2. Although different responses styles and level of details are given, in this case only ClimateQ&A reflects the full magnitude of CO2 emissions from fossil fuels. ChatClimate only focuses on the magnitude of the past decade.

These three short experiments call for a in depth reflection, evaluation as well as comparison of tools from the side of domain experts such as climate scientists. A compilation of sensitive and nuanced topics should become a standard procedure in the age of AI as well as a comparison on how such topics are treated across different elements of the reports. Hence, it is essential that experts who are familiar with the breadth and scope of each specific topic test these tools regularly. In absence of regular testing and evaluation, potentially well-crafted and convincing answers might hide incorrect or outdated statements (Bender et al. 2021). The process could be improved by performing frequent rounds of response ranking that engages a large pool of researchers, covering the full range of scientific views as well as LLMs serving as advocators (Leippold et al. 2024). Convergence in response rankings could be elicited following a Delphi style procedure such as the one performed in IPCC AR6 (Zommers et al. 2020). Moreover, the performance of the technology can be improved by using sensibly curated corpora. These corpora can in principle be tailored to specific topics as, for example adaptation (Sietsma et al. 2024a, b), where using AI allows for small teams of authors (e.g. authors of IPCC special reports or chapter authors) to develop queries tailored to the information they are interested in (De-Gol et al. 2023). In the context of the IPCC, the Technical Support Unit or another initiative (e.g. as was done with the Global Adaptation Mapping Initiative, see Berrang-Ford et al. 2021a) may play a role in this by providing authors with the necessary skillset and resources needed to take full advantage of the AI capability.

The way forward in the age of AI

Deploying domain specific LLMs provides researchers and users of scientific knowledge targeted access to specific information and answers. Using LLMs and specialised chatbots could relieve authors from the lengthy searching and meticulous procedures of reading through thousands of papers and leave more time for synthesis work which include compiling and evaluating the evidence (Mach et al. 2017). Using AI tools in the literature selection and review process also minimises potential subjectivity in the assessment, thereby ensuring these global assessments can continue to assess the full and diverse range of scientific perspectives. It is suggested that the scientific community can greatly benefit from the use of AI for better scientific communication, knowledge accessibility and synthesis. In line with open science and open research data principles, researchers from around the world can access and contribute to shared knowledge, promoting a more inclusive and globally connected community. AI has clear potential to improve the efficiency of the process and ultimately enhance the comprehensiveness and usability of the reports.

The careful and labour-intensive process of Global Environmental Assessments has proven to be both its greatest asset, but also a barrier to remaining relevant and up to date in the modern age of urgency and thrust for latest knowledge. The use of AI tools to assist in the review and synthesis of the scientific literature can reduce the burden on experts and expedite the assessment process, thereby paving the way towards more regular release of reports and streamlining the transfer of latest science into action. So far, we have not addressed the capability of large language models (LLMs) to interpret figures and tables, but this is becoming increasingly feasible. While it remains a challenge to fully interpret the complex and rich figures found in GEA reports, the rapid pace of advancements suggests that this capability may become standard practice sooner rather than later (OpenAI 2024).

Clearly, AI tools are not a substitute for the rigour that only experts can provide. As it is demonstrated in the previous section, there are nuanced concepts which might be tangentially covered in GEAs or for which there is limited evidence and scientific consensus. A periodic evaluation of the tools with a large pool of experts must be encouraged to secure that answers are anchored in latest science and to avoid misleading information. Given the rapidity with which AI technologies are developing, it is essential to establish ethical procedures for their use. This involves addressing ethical considerations such as reproducibility and bias mitigation. We recommend the creation of ethical committees, and the organisation of dedicated expert meetings. Specifically, we call for the development of guidelines for best practices in integrating AI into global environmental assessments, ensuring that the adoption of these technologies is both responsible and transparent. Additionally, training LLM models on huge amounts of data has a potentially very high carbon footprint and we have little knowledge about the carbon footprint embedded in LLMs such as GPT-4 (Bender et al. 2021). Inference, and the use of already trained LLM models, is increasingly energy intensive, the larger the models are. Therefore, the LLMs community itself needs to implement environmentally aware workflows to avoid contributing to the challenges they claim to tackle (Hershcovich et al. 2022).