1 Introduction

​Within the current most influential and controversial technological innovation frontiers, advancements of AI-driven large language models (LLM) are quickly transforming the generation and access to expert knowledge [1, 2]. By utilizing natural language processing, an LLM-based chatbot is powered by a neural network model that enables user interaction through a chat-like interface [3]​. By providing personalized and contextualized answers through complex conversations on the web, using supervised and reinforcement learning techniques these AI-driven technologies have recently gained immense popularity globally [4]. There are various chatbots, including open-source alternatives, such as ChatGPT, Microsoft Copilot, Google Gemini, Anthropic Claude, xAI Grok, LLaMa via Perplexity, and HuggingChat [5, 6]. Among these LLM-based chatbots, ChatGPT from OpenAI is the fastest-growing consumer application in history. ChatGPT evolved into the version 4 model in 2023, which is designed to supposedly offer more precise and contextually relevant responses than the earlier GPT-3.5 model [7,8,9]. These advancements rely on technical solutions that train LLM with more parameters, expand their memory, and integrate access to online information on current events [10]. While chatbot developments expand the capacity for processing vast amounts of datasets and generating context-specific responses [2], a growing debate has questioned whether ethical considerations are also considered and improving in these chatbot upgrades [1, 11].

Chatbots are not simply objective or neutral AI tools that enhance the efficiency of processing and accessing online information. These AI-driven models directly influence user judgment and decision-making, potentially generating bogus data and biased information that can result in instances of injustice and inequalities [3, 12]. The reliance on generative AI models may sideline diverse human voices, centralize the control of information and benefits, infringe on copyrights and data ownership, and neglect the nuanced viewpoints of multiple stakeholders [3, 13]. When the absence of safeguards and regulations raises critical issues around the responsibilities of digital innovation, it is vital to examine how these growing AI innovations can lead to uncertainties and potential unethical consequences [12, 14].

This paper considers the ethical implications of chatbots and their use as a source of expertise based on a critical analysis of evidence generated from GPT to inform ecological restoration practices. As generative AI is progressively being developed, chatbot-generated information may have significant influence in informing conservation science and policymaking around the globe [15]. Yet, the intricate and dynamic nature of ecosystems - encompassing numerous species interactions, situated environmental factors, and multiple knowledge practices - demands a nuanced understanding of diverse AI impacts [16]. Big data-driven knowledge production has recently been questioned when adopted in conservation research and practices, raising critical concerns regarding biases, inaccuracy, transparency, and legitimacy in decision-making processes [17,18,19]. While there has been some attention on the use of chatbots to assist in conservation education and research [13], less attention has been paid to the ethical implications of bias in the evidence presented by this AI-driven research assistance.

The inclusion and use of diverse sources of evidence and perspectives is critical to inform effective and just conservation decisions [20]. There are now diverse methods used to elicit expertise and each involve gathering and negotiating knowledge practices to support informed conservation targets and strategies [21]. Studies show that multiple evidence-based approaches are pivotal to understand a multiplicity of risks and options to assist decision-makers [22] and inform conservation options that achieve biodiversity and local community goals [23]. Chatbots have the potential to assist this process by quickly offering insights based on analysis of available online literature and extensive environmental datasets to inform insights for scientific research and policymaking. Although LLM advancement holds opportunities to speed up the elicitation of available information, there is a wide range of risks and uncertainties when applying AI generative content to review available knowledge needed for complex conservation issues.

In this paper, we analyse the ethical issues around sources of knowledge and evidence used by ChatGPT to provide information and insights about ecological restoration practices. We analysed 40,000 answers from GPT-3.5-turbo and GPT-4 model to consider the ethical considerations of GPT responses in terms of diversity and inclusion of experts, geographic distribution of information sources, stakeholder representation, and data validation. Our assessment demonstrates that while there has been technical expansion of the chatbot model to cover extensive data sources, these AI developments have not necessarily led to ethical consideration in terms of fair representation and inclusion of diverse expertise and stakeholders.

2 AI-driven assistance in conservation

Conservation science is a field characterized by intricate ecosystems and species interactions, frequently compounded by an urgent need for formulating strategies and policies to address complex environmental change issues [24]. The process of systematically consulting experts to inform judgments is a critical component of decision-making practices, particularly when the level of uncertainties and risks in these complex contexts is high [20]. Expert elicitation can provide critical insights to inform how conservation planning, management practices, and interventions can consider ecological dynamics, species biology, and environmental threats to enhance the effectiveness and sustainability of initiatives [25]. The increasing adoption of data-driven tools to gather, produce, and circulate information can directly affect these expert elicitation practices [26, 27]. This is particularly influential in the case of generative AI developments, where the capacity to assemble big data provide contextualized evidence and can deliver quick responses to complex problems. At the same time, the use of LLMs in expert elicitation can lead to several ethical concerns, including: Whose expertise is considered in the expert elicitation? How are errors and biases recognised to ensure the reliability of evidence? What measures are in place to enhance transparency regarding the sources and methodologies? These growing issues regarding auto-generated information require a deep investigation into how the different steps of expert elicitation should question the ethics of using AI-driven tools to inform knowledge production and decision-making processes.

Expert elicitation in conservation science typically follows different stages to ensure the relevance and reliability of the information gathered to inform decisions [28] (Fig. 1). These processes can now be combined with the adoption of chatbot practices to assist in searching information and automating research tasks across different expert elicitation activities. A variety of methods are used and available to negotiate how evidence and information can be considered and applied [28]. This includes methodological approaches to determining the source and forms of expertise that can be integrated into analytical models, as well as determining whose voices are considered and which forms of knowledge are included or excluded [29]. LLMs can assist this process by providing rapid summaries and insights from relevant literature, clarifying terminologies, incorporating conceptual formulations into analytical models, and building scenarios for suggested conservation decision or approach.

In the elicitation processes, experts determine the variables to be elicited by identifying the most significant factors that impact the decision-making processes. The goal is to pinpoint and reveal relevant information about the areas of uncertainty in parameters that hinder effective decision-making [30]. LLMs can support this process by identifying knowledge gaps and framing questions where expert consensus is lacking. This initial scoping dimensions of the elicitation reveals several ethical issues, including how the selection of specific expert knowledge for elicitation ensures a diverse and inclusive representation of different perspectives [29].

Designing the elicitation encompasses the tasks of delineating methods to manage bias, determining the elicitation format, identifying and compiling background materials, testing and finalizing questions, developing scenarios, and defining logistics for interactions with experts [26]. Chatbots can be adopted to tailor information considering specific needs and context of projects and interventions. For instance, LLMs can generate hypothetical case studies based on place-based experiences that assist in developing training materials for the stakeholders to standardize the understanding. Also, LLMs’ capabilities to assemble context data can assist in designing interviews, surveys, or questionnaires that can be used to elicit expert inputs or opinions. These activities may lead to critical ethical concerns involving transparency in how generative AI inadvertently introduce biases and how chatbot-generated information aligns accurately with the specific needs and context of projects and locations.

Performing the elicitation also requires experts to express knowledge in the required quantitative terms, while indirect methods involve experts answering questions related to their experiences, with their responses subsequently translated into the necessary quantities [31, 32]. The information from experts can be gathered separately or collectively through a Delphi method, for instance [33]. This step may involve statistical methods to consolidate individual responses into a collective output to support reaching a consensus, while documenting the degree of uncertainty and dissent among the experts [34]. In this stage, chatbots can be applied to generate rounds of questions and help synthesize and summarize feedback across different languages and groups. Ethical considerations include ensuring effective communication and enabling equitable participation in the consensus-building process, which may arise from linguistic or cultural differences.

After all these stages, the elicitation is finalized by encoding the information through quantitative statements usable in a model [28]. While existing software tools are already adopted for automating these reporting tasks [27], chatbots can further assist in capturing the nuances of data analysis and documentation to highlight common themes, consensus, and divergence in perspectives. The use of AI tools also raises issues to ensure that human expertise is appropriately integrated for diverse interpretations and establishing accountability for potential impacts on the final outcomes.

Fig. 1
figure 1

Ethical formulations surround the adoption of chatbots in expert elicitations, including key ethical considerations (arrows) from planning to reporting knowledge production and decision-making processes

3 Research methodology

In this paper, the research methodology includes the dataset collection from GPT-3.5-turbo and GPT-4 models and then the analysis of diversity and inclusion [35, 36] of experts, geographic distribution of information sources, stakeholder representation, and data accuracy in conservation.

3.1 Dataset collection

We developed a questioner of 20 questions regarding expertise, information sources, and stakeholders (Supplementary Table 1) [37]. We collected the answers from ChatGPT API within the timeline of June to August of 2023 based on the ‘GPT-3.5-Turbo’ and ‘GPT-4’ model. To ensure comprehensive data collection, we asked each question 1,000 times to each model, resulting in a dataset of 40,000 answers where the dataset includes 20,000 answers for each model. By obtaining 1,000 responses for each query, we aimed to provide ample opportunities for the models to generate diverse answers, minimizing the risk of overlooking inclusivity and diversity. This approach not only contributes to mitigating the threat to generalizability but also ensures that the obtained distribution is representative and applicable to the specific query at hand. Our first 10 questions regarding expertise and information sources covered diverse dimensions of ecological knowledge including experts in terms of researchers and practitioners, existing relevant literature, influential research centres, relevant restoration projects, restoration actions, knowledge, and experiences. Our next 10 questions regarding stakeholder participation covered the diverse dimensions of ecological stakeholders including influential stakeholders and organizations, different forms of participation, policymaking engagements, successful, innovative, and real-world examples of stakeholder engagement in restoration.

3.2 Diversity and inclusion analysis

For analysing diversity and inclusion in LLMs we extend beyond the technical realm to encompass the inclusion of diverse human voices in terms of affiliated organizations, geographical locations, and gender of the experts recommended by the models. We asked GPT-3.5-turbo and GPT-4 models “Who are the main experts in ecological restoration and what are their affiliations and country?” We used the OpenAI ChatGPT API to ask the same question repeatedly and created two comprehensive lists of 100 experts recommended by GPT-3.5-turbo and GPT-4 models, respectively. We selected the first 100 unique experts recommended by each model. To identify any missing information regarding the recommended experts in the ChatGPT models’ answer, for each expert, we manually searched relevant websites such as Google search, LinkedIn profiles, and the corresponding organization pages of each suggested expert. Then, we analysed the diversity and representativeness of experts’ organization types, gender, and country for the ChatGPT’s recommended experts.

3.3 Data validation analysis

In expert elicitation, having accurate information to inform conservation assessments and input is of utmost importance. Hence, we analysed how accuracy changes over different versions of ChatGPT (i.e., GPT-3.5-turbo and GPT-4) in terms of the recommended experts. We analysed the unique 100 experts’ lists for each of the model (GPT-3.5-turbo and GPT-4). We calculated the accuracy by analysing the validity of the recommended experts in the ChatGPT models’ answer, where for each expert, we manually searched diverse relevant websites such as Google search, LinkedIn profiles, and the corresponding organization pages of each expert to verify if the recommended expert by ChatGPT models are existing in real or hallucinated.

3.4 Geographic distribution of information source analysis

To analyse the geographic distribution of information sources, we identified the frequencies of countries mentioned in 10,000 ChatGPT answers from each model to the first 10 questions regarding expertise and information sources. We also analysed the distribution of the frequencies of the mentioned countries and the percentage ratio of countries in each GPT model’s answers for the income-based categories of the countries. We mapped the countries listed in the GPT models’ responses with income categories of low, lower medium, upper medium, high, and uncategorized. This categorization was determined according to the new World Bank country classifications by income level for the year 2022–2023 [38].

3.5 Stakeholder representativeness analysis

From each GPT model, across 10,00 answers to the stakeholder participation questions (last 10 questions (question 11–20), we identified the stakeholders’ organizations based on the codebook of organization categories as presented in Supplementary Table 2 [37]. We performed frequency analysis of each organization type in each GPT model to analyse stakeholder representativeness.

Fig. 2
figure 2

Comparison of conservation science expert distribution between GPT-3.5 and GPT-4 in terms of experts’ (a) affiliation, (b) gender, (c) factual validation, and (d) location. Notable changes include (a) a decrease in university representation and inclusion of media sources in updated GPT-4, (b) an overrepresentation of male and absence of non-binary experts in both GPT models, (c) a significant improvement of validation accuracy in updated GPT-4 model, and (d) a decrease but still a dominance of experts from USA but a broader range of countries included in literature used in updated GPT-4

4 Results

The use of chatbots, such as GPT-3.5-turbo or GPT-4, to support the elicitation of ecological restoration expertise presents several ethical challenges. LLMs are trained on vast datasets, which may contain biased perspectives that could skew the knowledge base. From diversity and inclusion perspective, Fig. 2a demonstrates the affiliated organization categories of the experts who are recommended by GPT-3.5-turbo and GPT-4 model. The categories which we identified from both models are universities and research institutes, government agencies, private companies and NGOs, and international bodies. While the overall numbers have shifted over the updated GPT model as shown in Fig. 2a, there is still a significant representation from universities as the number of experts from universities are above 60% for both GPT models. The updated version of ChatGPT reduced 12.5% representativeness towards university and introduced media (e.g., ecology journalists) as a source of experts. The inclusion of media in GPT-4 shows the increased data diversity in the updated model as it recognizes the role of non-technical experts. However, the representativeness of the newly added source of expert is still significantly low (4%). Figure 2b demonstrates the gender of the experts from diversity and inclusion perspective who are recommended by GPT-3.5-turbo and GPT-4 model. ChatGPT models dominantly rely on the expertise of male academics (> 45%). Besides, the absence of non-binary experts in both models raises concerns about inclusivity. Figure 2d demonstrates the location of the experts who are recommended by the models. In GPT-3.5-turbo, the majority of experts are from the USA (73%), with Canada (10%) and Australia (7%) following behind. On the contrary, GPT-4 demonstrates a more diversified representation with the USA still leading but reduced to 57% experts. In GPT-4, there is a more diversified distribution with the new inclusion of experts from various European countries, like Germany, Ireland, Denmark, Belgium, as well as from Japan, India, and Chile. However, in the updated GPT model, the overall representativeness of all the newly included countries is only 9% while the representativeness of USA alone is 57%. In terms of data validation, our analysis shows that for the updated version of ChatGPT, accuracy is improved by 32.8% as shown in Fig. 2c. These findings demonstrate that the development of new chatbot model improves accuracy and covers an expanded list that includes new affiliations and locations of experts, but a significant dominance of male academics and North American experts remains.

From a geographical distribution of information sources perspective, the frequency of countries mentioned by GPT-4 model is 1.5 times more than the prior version. The updated model included 33 new countries which were not mentioned in the responses of the previous version as shown in Fig. 3. In contrast, GPT-3.5-turbo had only 4 countries which were missing in the updated version. Figure 3b demonstrates the frequency distribution of mentioned countries in both GPT models. Only 8% countries cover 72.2% and 75.5% mentions of all the countries in GPT-3.5-turbo and GPT-4 model, respectively. This significantly dominant 8% countries are the high-income countries, such as USA, Canada, Australia, UK, and France. On the contrary, the remaining 92% countries represents only on average 26% of the overall frequency of mentioned countries for both models. Although all categories of countries based on income had a significant increase in their mentions by the updated ChatGPT model, the relative frequency of the mentions remained almost the same as shown in Fig. 4b. High and upper-medium income countries were the most mentioned countries (at least 88% for both model) by ChatGPT. These results demonstrate that the development of new chatbot model covers a wide geographic distribution of information sources. However, this expansion in the wider information sources does not necessarily change the centrality of high and upper-medium income countries and the lack of representation of lower-medium and low-income countries.

Fig. 3
figure 3

Comparison of mentioned countries by GPT-3.5-turbo and GPT-4 model. Although the inclusivity of more different countries is observed in GPT-4 model, both GPT model heavily relies on narrow expertise from the global north

Fig. 4
figure 4

Comparison of mentioned countries when asked about ecological restoration in GPT-3.5-turbo and GPT-4 model in terms of (a) frequency of mentioned countries, and (b) relative frequency distribution of countries based on income category. Although (a) all income categories of countries had a significant increase in their mentions by the updated ChatGPT model, (b) the relative frequency of the mentions remained almost the same with significantly prioritized high and upper-medium income countries and neglected low and lower-medium income countries

In terms of a stakeholder representativeness, the number of organizations mentioned by GPT-4 model doubled compared to the prior version as shown in Table 1. The updated model delivered answers with more detailed content and further descriptions of the organizations engaged in restoration actions. Although all types of organizations had a significant increase in their mentions by the updated ChatGPT model, the relative frequency of the mentions remained almost the same. Government agencies were the most cited organisation type by ChatGPT in both models. Mostly North American agencies were described, including the Environmental Protection Agency, Forest Service, and National Park Services. Together with universities and international bodies, Indigenous groups had a slight increase in the frequency of representation (2%). This finding demonstrates that the development of new chatbot models covers an expansion of databases with a significant capacity to increase the complexity of building texts. However, this expansion in the volume of information does not necessarily change the centrality of powerful organizations and the representation of overlooked groups.

Table 1 Frequency analysis of organization type in GPT-3.5-turbo and GPT-4 model

5 Balancing evolving technology and ethics

The rapid chatbot developments have expanded the capacity to integrate datasets and process extensive information, attempting to generate more nuanced and accurate responses [39]. Our analysis shows that the chatbot advancements from ChatGPT-3.5 to ChatGPT-4 models have led to the delivery of more accurate responses. However, our findings reveal a striking disparity: while there is a significant enhancement in accuracy in the updated ChatGPT model, these changes not necessarily improve the ethical representation of diverse expert voices and perspectives, information sources, and stakeholder participation.

Ethical debates question how AI-generated information influence the legitimacy of particular sources and types of expertise and ensure a diverse and inclusive representation of different perspectives [3]. Our analysis shows that ChatGPT still relies heavily on information sources in high-income countries and from male academics, while dismissing expertise sourced from Indigenous organizations and low-income countries. Yet, debates around just and ethical conservation highlight the importance of including diverse experts with a deep understanding of specific geographical locations and local issues [40, 41]. The lack of representation of diverse sources of expertise in this chatbot’s content reinforces concerns around the power dynamics, biases, and inequalities in conservation science [42]. Such imbalances may raise concerns about fairness and inclusivity when chatbot tools are incorporated into the process of collecting and translating available sources of expertise. The absence of such experts could result in solutions that are less culturally sensitive or contribute to the reproduction and perpetuation of stereotypes, misinformation, and a lack of awareness regarding the challenges facing ecological restoration solutions and across the world [41]. Ethical conservation can and should cross-fertilise with ethical AI to develop more robust and culturally sensitive conservation strategies on a global scale based on well curated large datasets that can provide salient information from experts in regions, cultures, and locations around the globe [40, 41]. Ethical considerations for the use of AI tools for expert elicitation also reveals how chatbots may lack the contextual understanding of conservation issues, including interpretations, predictions, and representations of place-based experiences.

This study calls for a balanced exploration of generative AI applications to help efforts to draw on available expertise to inform conservation science methods and efforts. This requires chatbots to not only be accurate, but also consider the inclusion and representation of diverse perspectives and voices. In navigating these chatbot ethical considerations, human-centred interventions become imperative to focus on shared rights and responsibilities between users and developers. From the developers’ responsibility perspective, the ongoing scrutiny, validation, and refinement of LLM-based tools can contribute to refining these AI tools [43]. The current limitations with the data, distribution representation, and presence of geographically dominant online content in the training datasets should be mitigated by the developers as their share of responsibilities. Here, we also elevate the critical importance of users’ responsibility for addressing ethical concerns and fostering a more holistic and equitable approach to expert elicitation in conservation science. Users have the responsibility for questioning the use and applications of this information, particularly to consider potential harms and impacts in specific contexts [43]. In this sense, users can adjust and fine-tune the prompts by modifying the wording, adding context, or selecting prompts to better approach their interests, perspective, feedback, or guidance. For instance, user-guided tuning and automatic prompt engineering are user-led interventions designed to enhance the performance of AI chatbot models to meet specific demands. Additionally, users must exercise caution and precision in their requests, clearly defining ethical representation and considering its multidimensional aspects. The development and use of these generative AI tools require critical considerations of how automated content generation can be reflexive in conversations, enabling dialogues that necessitate critical debate and analytical interpretations. Implementing transparent and inclusive methodologies by developers and users [44] that actively seek out and incorporate various voices will help mitigate the risk of unintentional oversight associated with the application of chatbot-generated information in expert elicitation processes [45]. Relying too heavily on chatbots could limit the involvement of actual experts and knowledge co-production, potentially ignoring the nuanced and dynamic aspects of ecological environments that computational models might overlook. Therefore, thoughtful integration of chatbots that respects ethical norms and complements human expertise is crucial for their responsible application in expert elicitation in conservation research and practices.

The horizon of generative AI advancement is changing very rapidly, which demands further research on the creation and adoption of latest LLMs in expert elicitation to ensure a diverse and balanced training set which can embrace several ethical considerations. Future research can focus on the application of newly announced Custom GPT called GPTs in conservation science, which has the capability to create customized version of ChatGPT [46]. Users can upload knowledge files directly to the GPT builder to consider when generating a response in custom GPT. Hence, unique custom GPTs can be developed for specific use-case of expert elicitation. Similarly, expert elicitation that requires knowledge on existing research literature can adopt ResearchGPT, developed by Consensus [47], which enables users to get responses, find papers, and prepare documentation based on scientific research by searching a database of more than 200 million papers in the ChatGPT interface and requires further focus on its ethical use [48, 49]. In the ethics of chatbots, a potential research direction involves investigating alternative modes of formulating inclusive prompts when querying LLMs to reveal nuances and multiplicity of perspectives. By explicitly requesting a broad range of perspectives considering factors, such as geography, gender, race, ethnicity, and organizational background, future research can explore the potential capabilities of chatbots to generate responses that may go beyond existing biases in training data. Moreover, extensive research is required to design, collect, and analyse data to count the actual experts from each geographical region in ecology to compare with the ChatGPT recommended distribution. Other required areas of research include expanding the methods by which chatbot datasets are assembled and produced, focusing on representation not only in terms of data volume but also on including a diversity of contexts and realities associated with the questions and content. These issues are critical to recognize and analyse biases in generative AI models, specifically through systematic examination and validation against biases observed in non-generative AI approaches. As the LLM-driven advancements expand, these considerations and debates are essential to ensure ethical inquiries into evolving generative AI products and services with the capability to recognize and embrace multiple implications and perspectives across varied sectors [48, 49].