1 Introduction

In recent years, large language models (LLMs) represented by OpenAI’s ChatGPT (OpenAI 2022), emerge as a core research of natural language processing (NLP) (Zhao et al. 2023). Characterized by their significant size, often containing tens or hundreds of billions of parameters, these models have approached, and in some cases even surpassed human-level performance in language comprehension, knowledge understanding and many other capabilities (Kiela et al. 2021). With the ability to generate human-like text, the impressive capabilities of LLMs ignited people’s exploration and interest in the broader realm of artificial general intelligence (Bubeck et al. 2023). Researchers and practitioners can utilize pre-trained LLMs as foundation models, which can be fine-tuned to meet specific requirements (Bommasani et al. 2021). Inspired by models such as T5 (Raffel et al. 2020) and GPT-3 (Brown et al. 2020), language tasks can be transformed into a text-to-text format by selecting the appropriate prompts. Even absent fine-tuning, these foundation models are increasingly capable of facilitating few- or zero-shot learning in a wide range of scenarios (Liu et al. 2023). Predominantly, these LLMs utilize Transformer-based architectures, which have become the gold standard in NLP due to their unparalleled ability to handle sequential data through self-attention mechanisms (Vaswani et al. 2017). A defining attribute of such models is their autoregressive nature, which means that they predict the next token based on all the previous tokens, resulting in highly coherent and contextually relevant outputs (Radford et al. 2019).

Fig. 1
figure 1

An example of bias and hallucination. Bias information is highlighted in Red, and hallucination information is highlighted in Blue

Despite the remarkable success, nearly all existing LLMs face two primary challenges and limitations. Typically trained on vast amounts of data, which are often sourced from vast and diverse online corpora, LLMs inherit toxic, offensive, misleading, stereotypical, and other behaviors that are harmful or discriminatory (Bolukbasi et al. 2016; Caliskan et al. 2017; Blodgett et al. 2020; Bender et al. 2021). These are the manifestations of what is commonly referred to as bias. Moreover, as exemplified in Fig. 1, LLMs often deviate from the truth when responding to user input, generating content that appears fluent and accurate but is, in reality, fabricated or baseless (Longpre et al. 2021; Adlakha et al. 2023; Ji et al. 2023). This phenomenon is the other primary challenge that is commonly referred to as hallucination, which severely impacts the credibility and reliability of LLMs, making it difficult to apply in many professional decision-making contexts (Kaddour et al. 2023; Rawte et al. 2023). Beyond the two issues, there are many other topics related to the trustworthiness of LLMs, as discussed in Sect. 6.3. However, this paper’s focus on bias and hallucination stems from the fact that these concepts encompass a wide range of specific issues and play a key role affecting the trustworthiness of LLMs. Firstly, bias and hallucination are broad concepts that represent some of the most common issues encountered when developing general-purpose LLMs, especially hallucinations, which are almost unavoidable or frequently occur in all mainstream conversational LLMs, such as ChatGPT (OpenAI 2022). Bias covers not only common issues like racial discrimination and gender bias but also includes various manifestations such as political and regional biases. The problem of hallucinations is not limited to specific task domains, but also concerns the general generation of content, especially when attempting to use LLMs for content retrieval. Therefore, choosing these two topics allows us to cover multiple aspects of the trustworthiness issue in LLMs, rather than being confined to a specific domain. While LLMs have some other ability-related issues like failing to respond to questions within their capabilities and struggling to fulfill word count requirements, these issues primarily arise from underlying training strategies and fundamental model architecture (Gao et al. 2023). Such issues are beyond the scope of this paper.

In relation to the impact of LLMs on scientific research, Van Dis et al. (2023) highlight five key issues in Nature: holding on to human verification, developing rules for accountability, investing in truly open LLMs, embracing the benefits of artificial intelligence (AI), and widening the debate. As the applications and ubiquity of LLMs continue to grow, so does the imperative to address these challenges head-on. Addressing the biases and hallucinations in LLMs is not just about improving model accuracy; it is about ensuring that AI technologies are used ethically, responsibly, and in ways that promote societal good. With this growing associated concerns, rigorous LLMs content auditing becomes paramount. Auditing serves as a crucial governance mechanism, designed to identify and mitigate potential risks in AI systems (Mökander et al. 2023). Generally, auditing can be categorized into three high-level domains: governance auditing, model auditing, and application auditing (Mökander et al. 2023). Nevertheless, the focus of this paper is on the auditing of content generated by models, which can be classified under model auditing at the technical architecture level. It ensures that the content output by LLMs is accurate, fair and unbiased.

Currently, there are many studies analyzing and addressing the issues associated with the content generation by LLMs. Most of these studies, concerning bias, focus on model’s tendencies to manifest prejudices in certain areas, such as racial discrimination and political bias. While studies on hallucination issues are typically specific to tasks such as question answering (QA) or table-to-text, there is now increasing attention on general-purpose generation hallucinations.

1.1 Objective and limitation of recent review

Table 1 Recent reviews on bias or hallucination in language models

Several reviews have been conducted recently on bias and hallucination in language models as shown in Table 1. The majority of these reviews are from the year 2023. A recent review (Gallegos et al. 2023) formally defines bias and provides a comprehensive taxonomy aimed at evaluating and mitigating bias in LLMs, albeit it omits some currently mainstream and effective methodologies, such as reinforcement learning from human feedback (RLHF). Li et al. (2023) begins with the size of the models to discuss fairness issues, distinguishing between medium-sized and large-sized LLMs. However, this review is criticized for its repetitive discussions despite the differentiation. Two review papers (Ranaldi et al. 2023; Meade et al. 2022) conduct experiments to assess the performance of different debiasing methods. However, both papers similarly lack a systematic summary of debiasing methods and evaluation methods. Another review paper approaches the topic from the perspective of fairness in multilingual and non-English contexts, providing a concise overview. But this paper did not cover analytical analysis on bias evaluation and debiasing techniques. In 2020, a critical review paper (Blodgett et al. 2020) summarized the prevailing misunderstandings in the NLP field’s approach to bias. Many subsequent (Gallegos et al. 2023; Li et al. 2023; Meade et al. 2022) papers on defining bias have adopted the recommendations from this review. A review paper (Zhang et al. 2023) presents taxonomies of the LLM hallucination phenomena and evaluation benchmarks. But this paper did not properly categorize many mitigation methods into provided taxonomy. Huang et al. (2023) offers a well-organized classification of the latest research on hallucination in LLMs. However, its overly detailed categorization may overlook the interrelation between methods. Ji et al. (2023) explores the hallucination issue by focusing on specific NLP tasks, yet it limits its discussion on the evaluation and mitigation methods to these tasks without addressing their broader applicability. A concise review paper (Rawte et al. 2023) discusses hallucination issues across different modalities from the perspective of large foundation models. While the coverage is broad, each section lacks depth, and only a few studies are mentioned, highlighting the need for more comprehensive and detailed exploration in future research.

To the best of our knowledge, no work has yet provided a comprehensive review of these two primary concerns in content generation by LLMs. Thus, this paper makes an in-depth study in realm of content auditing for LLMs, focusing on the critical issues of bias and hallucination. The challenges, existing evaluation methods, and potential solutions are explored to ensure that the capability of LLMs is harnessed responsibly.

The rest of this paper is organized as follows: Sect. 2 illustrates the review methodology and presents the organization of this paper, as also depicted in Fig. 2. Next, issues of bias and hallucination are explored in Sect. 3 and Sect. 4, respectively. These two sections begin with examples of these problems, and discuss their causes. Subsequently, relevant evaluation methods and metrics are introduced, followed by a presentation of recent work towards mitigating the problems. A comparative analysis of bias and hallucination in LLMs is also presented in Sect. 5. Finally, Sect. 6 summarizes the current research trends and provides potential future research directions.

Fig. 2
figure 2

A guide that lays out the sections of the review paper

2 Review methodology

This paper searches several electronic archives for related articles, including Web of Science (https://www.webofscience.com/wos/alldb/basic-search), Google Scholar (https://scholar.google.com), ACL Anthology (https://aclanthology.org), AAAI Digital Library (https://www.proceedings.aaai.org/Library/library.php), IEEE Explore (https://ieeexplore.ieee.org/Xplore/home.jsp) and Springer Link (https://link.springer.com). Relevant keywords such as “Bias”, “Hallucination”, “Large Language Models”, “Evaluation”, “Benchmark”, “Mitigation” and so on, are cross-combined to find out publications related to this work. The collections are filtered based on the titles and abstracts of each papers. Some of these collected papers also make a convenience for leading the incorporation of additional references. Figure 3 displays the number of papers published between Mar 2022 and Oct 2023 that focus on bias and hallucination in LLMs. The figure illustrates a significant surge in research on these topics following the launch of ChatGPT in 2022.11, indicating a heightened academic interest in addressing these issues within LLMs. This review methodology ensures a holistic understanding of the subject, incorporating not only the most recent research but also key foundational papers that set the stage for recent advancements.

As illustrated in Fig. 2, this article categorizes the two main concerns regarding LLM-generated content into separate sections. Readers may selectively navigate through the content based on their interests.

Fig. 3
figure 3

The number of papers on bias and hallucination between Mar 2022 and Oct 2023

3 Debiasing in LLMs

This section starts by introducing the concept of “debiasing” and discuss its development in Sect. 3.1. Next, Sect. 3.2 presents the commonly used datasets and evaluation metrics related to debiasing. In conclusion, Sect. 3.3 explores the various strategies and methods for debiasing.

3.1 Manifesting of bias

This section begins with a definition of the debiasing problem. Subsequently, an overview of the research history related to bias is provided. Finally, a taxonomy of various bias categories is explored.

Commonly, the definition of debiasing is: the process of detecting, mitigating, or eliminating biases, especially in NLP and machine learning, ensuring that models and algorithms neither inherit nor propagate unequal, unfair or unsuitable information (Barocas et al. 2019).

While debiasing is an emerging area of interest, the study of debiasing has a deep-rooted history. Bias is not a recent issue. It has been intertwined with human civilization for ages [31], (Ntoutsi et al. 2020). Ethical concerns about AI surfaced almost as soon as the idea of AI merged (Wiener 1950; Largeault 1978; Josef 1976). Starting in the early 21st century, the discourse on bias in machine learning has amplified, making researchers increasingly vigilant about the prevalance of bias across daily tasks (Leavy 2018; Dastin 2022; Sweeney 2013; Ludwig 2015; Angwin et al. 2022; Wang and Kosinski 2018; Buolamwini and Gebru 2018; Luong et al. 2011; Calders et al. 2009; Kamiran and Calders 2009). It was only in 2015 that the field of NLP community formally acknowledged bias in word embeddings (Schmidt 2015). Between 2016 and 2017, three pivotal papers brought the debiasing challenge to the forefront (Bolukbasi et al. 2016; Caliskan et al. 2017; Zhao et al. 2017).

Currently, debiasing research often targets specific types of bias. This can be broadly categorized into three primary categories:

  1. 1.

    Racial and religious biases: This category includes biases that are based on race, ethnicity, or religion. Some studies (Caliskan et al. 2017; Greenwald et al. 1998) have found that names associated with European Americans are more likely to be linked with pleasantness, while non-European names tend to be associated with unpleasantness. Gender and orientation biases are foundational to debiasing research and remain the predominant area of investigation in this field.

  2. 2.

    Gender and orientation biases: Models often exhibit certain inherent stereotypes tied to gender roles (Caliskan et al. 2017; Bolukbasi et al. 2016). For instance, some models might associate cooking more closely with women (Zhao et al. 2017) or correlate “CEO” with men (Hendricks et al. 2018). Such linguistic practices are often tied to power hierarchies. Debiasing work in this area strives to prevent such stereotypical associations.

  3. 3.

    Political and cultural biases: Language models may also reflect biases in political or cultural contexts, often replicating the dominant ideologies or cultural attitudes present in their training data. The study of political and cultural biases is a relatively nascent area of focus. Additionally, evidence suggests that BERT models (Devlin et al. 2019) are perceived to exhibit a higher degree of social sensitivity in comparison to GPT models (Liu et al. 2022; Feng et al. 2023). Efforts to debias in this area attempt to establish a balance to prevent favoring one over others.

3.2 Bias evaluation

In this section, a taxonomy of metrics and methods for evaluating bias in language models is presented. Though many evaluation metrics measuring specific type of bias often depend on the dataset used, for clarity, this section segregates evaluation metrics and evaluation benchmarks into two separate sections. This separation is due to the categorization of evaluation metrics according to the evaluation methods they are associated with, which, compared to datasets of various data formats, allows for a more coherent classification system. The distinct delineation ensures a structured approach to understanding how different metrics are applied and the contexts in which they are most effective.

3.2.1 Evaluation metrics and methods

In some previous work, bias measures are categorized into intrinsic measures and extrinsic measures (Delobelle et al. 2022; Ramesh et al. 2023). Intrinsic metrics measure bias existing in pre-trained LLMs, while extrinsic metrics measure bias arising in the fine-tuning for specific downstream tasks. However, we note that this categorization does not neatly classify existing bias evaluation metrics, as there is considerable overlap. Prior to the prevalence of general-purpose generative LLMs, the emergence of bias was typically task-specific such as text classification or QA. Furthermore, historically, both the evaluation and debiasing methods were often directed at specific types of bias, such as gender biases. The seminal work of evaluating biases in language models (Bolukbasi et al. 2016) introduces metrics based on word embeddings. Metrics based on word embeddings were widely applied in early methods of bias evaluation, and inspired a multitude of subsequent refinements (Caliskan et al. 2017; May et al. 2019; Guo and Caliskan 2021; Dolci et al. 2023). With the rise of pre-training methods of masked language models (MLM), an increasing number of approaches begin incorporating the concept of masked tokens into bias evaluation. Additionally, observing the model’s generative responses to varying inputs has been a longstanding and common method in bias evaluation. Accordingly, in this section, a taxonomy of three classes of evaluation methods is presented. We present explanations of each evaluation method in Fig. 4. Each class will be thoroughly examined in the subsequent discussions.

Fig. 4
figure 4

Examples of three classes of bias evaluation methods

Embedding-based Evaluation Starting with Word Embedding Association Test (WEAT) (Caliskan et al. 2017), it serves as a fundamental research that measures bias at the word level by examining the similarity of static word embeddings. Building upon WEAT, Sentence Encoder Association Test (SEAT) (May et al. 2019) progresses the analysis to the sentence level. SEAT evaluates bias by employing hand-crafted templates filled with vocabulary specific to SEAT. These templates are designed to convey minimal specific meaning beyond the inserted terms, such as “This is <word>.” or “<word> is here.”. Subsequently, an encoder, such as BERT (Devlin et al. 2019), is employed to encode these sentences. The encoded sequences yield representations corresponding to specific tokens, from which the encoded representation of special token “[CLS]” is extracted to serve as the target concept embedding. Furthermore, Contextualized Embedding Association Test (CEAT) (Guo and Caliskan 2021) uses Reddit data as context templates, which extend WEAT to contextualized embeddings. There are also other WEAT extensions (Tan and Celis 2019; Lauscher et al. 2021; Dolci et al. 2023). Such kind of metrics basically computes cosine distances to measure the similarity between target concept embeddings and neutral attribute embeddings. Their differences are then calculated as a measure of similarity between the target concept and the different neutral attributes, given by:

$$\begin{aligned} s(w, A, B) = \text {mean}_{a\in A} \cos (w, a) - \text {mean}_{b\in B} \cos (w, b), \end{aligned}$$
(1)

where A and B are sets of neutral attribute embeddings, and w is the target concept embedding. Finally, bias is measured by computing the effect size, given by:

$$\begin{aligned} f(W_1, W_2, A, B) = \frac{\text {mean}_{w_1\in W_1}s(w_1, A, B)-\text {mean}_{w_2\in W_2}s(w_2, A, B)}{\text {std}_{w\in W_1\cup W_2}s(w, A, B)}, \end{aligned}$$
(2)

where \(W_1\) and \(W_2\) are two sets of target concept embeddings. A larger effect size indicates stronger bias within the LLMs.

MLM-based Evaluation MLM-based method specifically refers to approaches that utilize the idea of masked language model (MLM) (Devlin et al. 2019) to evaluate bias by measuring the probability distributions of the model’s outputs at the “[MASK]” position. Discovery of Correlations (DisCo) (Webster et al. 2020) uses templates with two slots (e.g., “<word> likes [MASK]” or “<word> is [MASK]”). The “<word>” slot is filled with potentially biased words (e.g., gendered names or professional name). The “[MASK]” slot is then predicted by the language model under evaluation, retaining the top three predictions or predictions with \(P([\text {MASK}]\vert T;\theta )>0.1\) (Lauscher et al. 2021). The measurement score is derived by averaging the count of different predictions across all templates, based on the premise that an unbiased model should exhibit similar probability distributions for the same template filled with different word sets. Log Probability Bias Score (LPBS) (Kurita et al. 2019) similarly uses templates like DisCo. However, it differs in its scoring approach, utilizing a more probabilistic calculation method and normalizing the model’s output probabilities at the “[MASK]” position using prior probabilities. Specifically, for a template like “[MASK] likes <word>”, they construct “[MASK] likes [MASK]”. This approach corrects for the model’s prior probability bias towards different target concept words, with the formulaic representation being:

$$\begin{aligned} \text {LPBS}(S)=\log \frac{p([\text {MASK}]\vert T_i;\theta )}{p([\text {MASK}]\vert T_{i(prior)};\theta )}-\log \frac{p([\text {MASK}]\vert T_j;\theta )}{p([\text {MASK}]\vert T_{j(prior)};\theta )}. \end{aligned}$$
(3)

Some methods use pseudo-log-likelihood MLM score (Salazar et al. 2020) to calculate a perplexity-based metric of all tokens in a sentence conditioned on the stereotypical tokens. In CrowS-Pairs Score (CPS) (Nangia et al. 2020), each sample should consist of pairs of sentences. One sentence in each pair is modified to contain either a stereotype or an anti-stereotype. For these sentence pairs, the authors measure the degree of stereotyping by calculating the probability of unmodified tokens given the modified set, denotad as \(P(U\vert M;\theta )\). To approximate this probability, they mask one token from the unmodified set at a time until all unmodified tokens are masked. The score is then computed using the following formula:

$$\begin{aligned} \text {CPS}(S)=\sum _{i=1}^{|S|}\log P(u_i\in U\vert U_{\setminus {u_i}}, M;\theta ). \end{aligned}$$
(4)

Then the bias score is computed as:

$$\begin{aligned} \text {Bias-Score}(S)=\frac{1}{N}\sum _{(S^{st},S^{at})}\mathbb {I}(\text {CPS}(S^{st})>\text {CPS}(S^{at})), \end{aligned}$$
(5)

where \(\mathbb {I}\) is the indicator function, which returns 1 if its argument is True and 0 otherwise. \(S^{st}\) and \(S^{at}\) are stereotypical and anti-stereotypical sentences. The ideal score for this metric score is 0.5 (Nangia et al. 2020). Similar to CrowS-Pairs Score, Context Association Test (CAT) (Nadeem et al. 2021) also compares sentence pairs. But in contrast to pseudo-log-likelihood MLM score, CAT calculates \(P(M\vert U;\theta )\), rather than \(P(U\vert M;\theta )\). While Crows-Pairs Score and CAT only consider predicting a single masked word, All Unmasked Likelihood (AUL) (Kaneko and Bollegala 2022) predicts all tokens in a sentence case given the MLM embedding of the unmasked input:

$$\begin{aligned} \text {AUL}(S)=\frac{1}{\vert S\vert }\sum ^{\vert S\vert }_{i=1}\log P(m_i\vert S;\theta ). \end{aligned}$$
(6)

Generation-based evaluation Increasingly, research is turning its attention to the bias issues in closed-source LLMs, such as ChatGPT (OpenAI 2022). Evaluating these black-box models presents unique challenges, as embedding-based and MLM-based methods are not applicable due to restricted access to their internal mechanisms. As a result, evaluation must depend solely on analyzing the generation from these models. The most straightforward methods for evaluating bias in generated texts is to use an additional model specifically designed to score the text for bias-related aspects. Alnegheimish et al. (2022) use natural sentences as prompts, extracted from real-world texts, such as Wikipedia. These sentences cover a range of professions. By employing these sentences as prompts, the model under evaluation is tasked with generating subsequent text. Through the analysis of these continuations, researchers can observe and assess the model’s performance in terms of gender and occupational biases. Additionally, there are some black-box commercial APIsFootnote 1 available for different languages. By calling these APIs and sending the content generated by LLMs, it is possible to detect and mitigate toxic or sensitive information in the generated content. Incorporating concepts from other NLP tasks, such as natural language inference (NLI), into bias evaluation is also a common approach. Dev et al. (2020) propose a bias evaluation method based on the expectation that an unbiased model would predict a “neutral” outcome for premise-hypothesis pairs such as “The nurse is playing tennis, The woman is playing tennis”. Conversely, a biased model might predict either “entailment” or “contradiction” for these pairs. However, such evaluation methods often involve fine-tuning the model under evaluation (Dev et al. 2020; Wald and Pfahler 2023), or the use of traditional sentiment analysis tools like VADER (Hutto and Gilbert 2014). These approaches are not very suitable in the current scenario, as they cannot maintain the original model parameters. Additionally, the training process involved might further exacerbate biases and lead to erroneous evaluation. Currently, the most practical generation-based methods involve prompting LLMs to continue a specifically designed text and then evaluate the degree of bias in the model based on the content of these continuations (Bordia and Bowman 2019; Bommasani et al. 2023).

3.2.2 Evaluation benchmarks

In addition to the well-known datasets SEAT (May et al. 2019), CrowS-pairs (Nangia et al. 2020), and StereoSet (Nadeem et al. 2021), there are several specialized or recently established datasets. Most widely used and newly proposed benchmarks are presented in Table 2. It can be observed that only a few methods (De-Arteaga et al. 2019; Zhao et al. 2020) employ embedding-based evaluation methods. The majority are MLM-based evaluation methods. With the increasing presence of black-box commercial LLMs, both embedding-based and MLM-based implementations cannot be implemented on these models. Thus, generation-based methods have been more frequently utilized in recent research (Zhao et al. 2023; Krieg et al. 2023). Such methods, like RealToxicityPrompts (Gehman et al. 2020), have become the main approach for the industry to measure model fairness. Most benchmarks include gender bias, with fewer including cultural bias. Some benchmarks also feature specific, nuanced types of bias, such as Chbias (Zhao et al. 2023) and Crows-Pairs (Nangia et al. 2020), which include biases related to individual’s appearance and age. HolisticBias (Smith et al. 2022) includes biases towards individual’s ability. Some representative works are presented in this section.

Table 2 Widely used benchmarks that used for evaluating bias in LLMs

CHBias (Zhao et al. 2023) stands out as a unique dataset that focuses on addressing bias in the Chinese language. Unlike most other datasets primarily focusing on English bias, CHBias collects data from Weibo, one of China’s largest social media platforms.

WinoBias (Zhao et al. 2018) is a dataset specifically designed to probe gender bias, features Winograd-schema style sentences that reference individuals through their occupations, such as nurses, doctors, and carpenters. While the official evaluation framework released with the dataset does not adopt an MLM-based evaluation method, instead resembling a downstream task method, many subsequent studies (Vanmassenhove et al. 2021; Sakaguchi et al. 2021; Felkner et al. 2023) based on this dataset have utilized MLM-based evaluation techniques. This trend is also observed in research using other Winograd-schema datasets and the GAP dataset (Webster et al. 2018).

RedditBias (Barikeri et al. 2021) is dedicated to the study of bias within dialogues and comprises real conversations sourced from Reddit, a platform known for its diverse user interactions.

WinoQueer (Felkner et al. 2023) takes a focused approach to examine biases related to sexual orientation. Developed through a community-in-the-loop method, it aims to assess whether LLMs encode biases harmful to the LGBTQ+ community.

BIGNEWS (Liu et al. 2022) is tailored for the analysis of political bias. It draws its pre-training datasets from online news articles with diverse ideological leanings and language usage, covering 11 media outlets with varying political stances from far-left to far-right. The cleaned dataset, known as BIGNEWS, includes a vast collection of 3,689,229 US political news articles.

3.3 Debiasing methods

In this section, we present methods designed to debias. We classify debiasing methods into three categories: pre-processing methods for data, in-processing methods during training, and post-processing methods for models, which are illustrated in Fig. 5. We will discuss each category in Sects. 3.3.1, 3.3.2, and 3.3.3, respectively.

Fig. 5
figure 5

Illustration of our taxonomy of mitigation states for debiasing

3.3.1 Pre-processing for data

Pre-processing methods aim to reduce bias in data, which is crucial since, under fixed model parameters, training data exerts the most significant impact on model performance. Many bias issues reflect the characteristics of the training data (Schramowski et al. 2022).

On this premise, biases in pre-trained language models largely arise from imbalances in their training data. A direct approach to counter these biases involves rebalancing the training data. Counterfactual data augmentation (CDA) (Zhao et al. 2018) is a primary method for data rebalancing, which is widely used (Zmigrod et al. 2019; Webster et al. 2020; Barikeri et al. 2021). To mitigate gender bias between male and female demographic groups, it is essential to ensure that gender-neutral terms exhibit consistent relationships with gender-specific terms. Take the sentence, “He is a doctor”, as an example. By employing CDA method, the gender-specific term “He” can be replaced with “She”, producing an additional training sentence, “She is a doctor.” (Lu et al. 2020). This ensures that both gender groups are equally associated with the gender-neutral term “doctor”. Alternatively, Dixon et al. (2018) do not inject biased examples into the data. Instead, they add non-toxic examples until a balanced distribution of toxic and non-toxic examples is achieved across different groups. Different from data rebalancing, counterfactual data substitution (Maudslay et al. 2019) probabilistically substitutes gendered words with counterfactual alternatives, without changing the number of examples.

CDA method is currently gaining significant attention. Nevertheless, innovative enhancements are under exploration. Semantic perturbation through controlled text generation is also a widely used approach to mitigate dataset biases (Gardner et al. 2021). It modifies sentences to match certain target attributes, such as verb tense or sentiment. By adjusting text, it disrupts entrenched biases, preventing models from depending on superficial correlations.

Some methods employ strategies to directly remove biased examples from the training data (Le Bras et al. 2020; Swayamdipta et al. 2020; Oren et al. 2020). While this method can be effective for highly biased datasets, it is somewhat unsatisfying to remove entire data examples due to bias present in just a single feature (Gardner et al. 2021). Some approaches  (Asai and Hajishirzi 2020; Li et al. 2020; Wu et al. 2021; Ross et al. 2021; Bitton et al. 2021; Madaan et al. 2021; Geva et al. 2022; Ross et al. 2022) focus on automatic generation of counterfactual data or contrast sets, with the goal of mitigating systematic oversights. Concurrently, some methods (Webster et al. 2020; Ribeiro et al. 2020; Asai and Hajishirzi 2020; Dua et al. 2021) leverage rule-based or heuristic methods to disrupt sentences, aiming to bolster robustness. Other approaches (Paranjape et al. 2022; Dixit et al. 2022) employ retrieval models to incorporate external knowledge. Additionally, several methods (Li et al. 2023; Xie and Lukasiewicz 2023) explore the integration of CDA with fine-tuning, prompt tuning, and adapter tuning techniques.

3.3.2 In-processing during training

In-processing methods are employed to debias during the training process. When the source of bias is rooted in the training data, it becomes crucial to ensure that model does not absorb or exaggerate these biases. Generally, in-processing methods can be categorized into three primary strategies: incorporating regularization terms, constraining the output of the model, and introducing additional loss functions.

Incorporating regularization terms typically means introducing perturbations during training to prevent the model from internalizing inappropriate information. One notable technique in this regard is the use of dropout as a regularization strategy. As Webster et al. (2020) suggest, dropout effectively interferes with the model’s training, compelling it to focus on essential information and preventing it from learning irrelevant associations. This method has shown significant improvements and highlights the significance of dropout as a regularization strategy.

Furthermore, adversarial training emerges as another form of regularization (Li et al. 2018; Zhang et al. 2018; Elazar and Goldberg 2018). Li et al. (2018) explicitly use adversarial learning to shield personal information, designating the protected information as the target for the discriminator during supervised training in addition to the primary training objective. However, Elazar and Goldberg (2018) point out that even after such training, traces of information can still linger in word embeddings. At its core, these adversarial models endeavor to optimize the predictor’s ability in predict the main variable of interest while simultaneously leading the adversary astray in predicting the protected attribute. But it is imperative to recognize that while effective, adversarial learning might exhibit instability. It is particularly apt when gender is perceived as a protected attribute, rather than a variable of primary concern. Recently, some methods (Li et al. 2023) employ contrastive learning to further prevent bias generation. Such methods are more stable.

Constraining the output is a straightforward approach. Zhao et al. (2017) propose adding constraints to the output that directly limit the ratio of males to females engaged in specific activities. This method underscores the importance of gender balance, as it would otherwise require prior knowledge of gender ratios. Another idea involves rewriting given text to rectify implicit and potentially undesirable biases (Ma et al. 2020). This can be seen as treating controllable debiasing as a new formalization of the stylistic rewriting task. However, both of these approaches have limitations. They either require prior information or depend on parallel corpora, which constrains further research in this area.

Introducing additional loss functions directly addresses the issue of models learning inappropriate associations at the level of the loss function (Zhao et al. 2018; Garg et al. 2019; Qian et al. 2019; Kaneko and Bollegala 2021). For instance, Zhao et al. (2018) proposed a novel learning scheme to train word embedding models with protected attributes (e.g., gender). This scheme represents protected attributes in specific dimensions while neutralizing others during training. By restricting the information of the protected attribute to certain dimensions, it can be easily removed from the embeddings. In a similar vein, Garg et al. (2019) introduced a metric called counterfactual token fairness to gauge counterfactual fairness in text classifiers. They actively optimize counterfactual token fairness during training phase. Another approach, as presented by Qian et al. (2019), involves the direct modification of the loss function in text generation. This modification aims to reduce gender bias in language models during training by ensuring an equal distribution of probabilities for male and female words in the model’s output. Meanwhile, Kaneko and Bollegala (2021) focus on debiasing pre-trained contextualized embeddings at the token or sentence levels. However, it is essential to note that this method of introducing additional loss functions relies on a rigid definition of bias. Therefore, the requirements for the loss function are stringent, making it somewhat inflexible.

3.3.3 Post-processing for models

Post-processing methods aim to remove bias from models after they have learned it. These methods can be broadly categorized into three types: projection-based, tuning-based, and probability-based methods.

Projection-based methods work by eliminating bias-related information in word embeddings. Schmidt (2015) introduced the first word embedding debiasing algorithm, which removed gender-related information. Bolukbasi et al. (2016) proposed an approach to ensure that bias-specific terms and bias-neutral terms had consistent vector distances. Building on this, Bolukbasi et al. (2016) proposed two debiasing methods: hard-debiasing and soft-debiasing. These methods first identify bias-related and bias-neutral terms and then reduce bias in the bias-related space. For example, in the case of gender bias, hard-debiasing ensures that gender-neutral words have zero representation in the gender subspace, making any neutral word equidistant from all gender-related words. This approach is suitable for applications where no bias is desired, but it may impact specific applications in certain domains. On the other hand, the soft-debiasing algorithm reduces differences among gender-neutral words in the gender subspace while preserving as much similarity to the original embedding as possible, with a parameter controlling this trade-off. Both hard-debiasing and soft-debiasing methods have been widely applied and further developed in research (Bordia and Bowman 2019; Park et al. 2018; Sahlgren and Olsson 2019; Bolukbasi et al. 2016; Karve et al. 2019; Sedoc and Ungar 2019). Subsequent methods have aimed to provide more accurate assessments of bias subspaces (Liang et al. 2020; Dev and Phillips 2019; Kaneko and Bollegala 2021; Ravfogel et al. 2020; Liang et al. 2020). Interestingly, it has been pointed out by Ethayarajh et al. (2019) that debiasing word embeddings using subspace projection is, under certain conditions, equivalent to training on an unbiased corpus. However, these methods heavily rely on predefined lists of gender-neutral words (Sedoc and Ungar 2019), and misidentifying gender-neutral words can impact downstream model performance (Zhao et al. 2018). There is also debate about whether the effects of debiasing can be fully reversed (Gonen and Goldberg 2019; Prost et al. 2019), and some methods suggest that complete debiasing might be undesirable in domains such as social science and medicine (McFadden et al. 1992; Back et al. 2010). Some studies (Zhao et al. 2018; Bordia and Bowman 2019) indicate that bias serves a distinct purpose in specific situations. These insights can serve as a foundation for researchers to strategically utilize biased information within large models. An in-depth discussion will be provided in Sect. 6.1.

Tuning-based methods aim to mitigate biases by employing various debiasing objectives and tuning approaches. These debias techniques encompass fine-tuning, prompt-tuning, and adapter-tuning, among others, as demonstrated by a range of studies (Kaneko and Bollegala 2021; Garimella et al. 2021; Lauscher et al. 2021; Zaheri et al. 2020; Askell et al. 2021; Yang et al. 2023; Li et al. 2023; Jin et al. 2021; Xie and Lukasiewicz 2023). Taking fine-tuning as an example, an upstream model undergoes fine-tuning with a debiasing objective. Subsequently, the upstream model in conjunction with the new classification layer, is subjected to further fine-tuning for downstream tasks. It is worth noting that tuning-based methods rely on external corpora, and the effectiveness of debiasing outcomes may exhibit significant variations depending on the specific external corpora used.

Probability-based methods utilize probabilistic models to adjust or correct a model’s output in order to reduce bias or unfairness. This approach appears similar to the “Constraining the output” method within In-processing methods, but their distinction lies in the fact that Probability-based methods do not require training; they are adjustments made after training the model. Schick et al. (2021) first investigate whether language models can detect undesirable attributes in their own outputs solely based on their internal knowledge, a process referred to as self-diagnosis. They then explore the potential of using this ability for self-debiasing, where language models can autonomously discard undesirable behaviors in a fully unsupervised manner. To achieve this, Schick et al. (2021) propose a decoding algorithm that initially prompts the generation of biased text using specific prompt words and then reduces the model’s probability of generating biased text. Importantly, this method does not compromise the language model and requires no additional training. However, it has limitations as it cannot be applied to downstream tasks. Subsequently, Guo et al. (2022) extend this concept by automating the search for templates that can easily induce bias in prompts. They use distribution alignment loss to mitigate bias in language models. However, this improvement comes at the cost of additional training, which offsets the advantages of the previous method.

4 Dehallucinating in LLMs

In this section, we begin by providing a comprehensive definition of hallucinations observed in LLMs in Sect. 4.1. Next, we explore the underlying causes of hallucination in Sect. 4.2. We then detail the metrics and methods used for evaluating hallucination in LLMs in Sect. 4.3. We also provide an in-depth analysis of strategies to mitigate this issue in LLMs in Sect. 4.4.

4.1 Manifesting of hallucination

In general terms, hallucination is defined as a perception that appears real but is not based on reality. In the realm of large models, hallucination refers to content that, while appearing fluent and coherent, exhibits anomalies. This specifically means content produced by the model that deviates from its input, lacks empirical evidence, is devoid of meaningful coherence, or contradicts real-world facts.

To elucidate the intricate facets of hallucination, academic endeavors have sought to classify its various types. Zhang et al. (2023) present a meticulous taxonomy in a pivotal contribution. They segment these deviations into three distinct categories: input-conflicting hallucination, evident when LLM-generated content significantly strays from user input, often signaling misconstrued user intentions; context-conflicting hallucination, manifesting in extended interactions where LLMs lose contextual anchoring, potentially due to their inherent limitations in maintaining prolonged memory or discerning critical contexts; and fact-conflicting hallucination, which arises when LLMs produce content in stark contradiction to recognized facts. This categorization illuminates the intricate challenges inherent to LLMs.

Moreover, Sun et al. (2023) and Chen et al. (2021) bifurcate hallucinations into “intrinsic” and “extrinsic” classes. Intrinsic hallucination pertains to outputs conflicting directly with their input, such as summaries that diverge from an original document’s core essence. Conversely, extrinsic hallucination encapsulates content containing unsubstantiated details, often resembling contact particulars. Notably, these details, while unauthenticated, might be arbitrarily crafted by LLMs or derived from their training sets. Both studies solidify the foundational understanding of hallucination typologies and the challenges of their mitigation.

Additionally, multimodal hallucination research has gained traction. Rawte et al. (2023) classify large foundation models into text, images, videos, and audio categories, examining hallucination discrepancies across these modalities.

4.2 Causes of hallucination

This section elucidates the potential causes of hallucination in LLMs. Broadly, these can be categorized into two primary dimensions: the data level and the model level. Understanding these factors is imperative to discern why LLMs might exhibit hallucinations.

4.2.1 Data level

Data quality The training data for large models may include content that is either inaccurate or unfaithful. Utilizing such flawed data for training can embed erroneous beliefs within the model, subsequently leading to the generation of misleading information.

McKenna et al. (2023) explore the behavior of prominent LLMs, such as LLaMA (Touvron et al. 2023), GPT–3.5 (OpenAI 2022), and PaLM (Chowdhery et al. 2023), specifically in the context of NLI tasks. The authors discerned two primary culprits behind hallucinations in these models: first, the proclivity of these LLMs to memorize training data, leading them to falsely affirm NLI test samples based solely on the presence of a hypothesis in training data, even if the premise does not support it. Secondly, LLMs were found to leverage a corpus-term-frequency heuristic, affirming hypotheses based largely on their frequency in training data, even when it led to erroneous outcomes. This tendency became particularly pronounced in the absence of relevant memorized text. Expanding on the impact of training data quality, Filippova (2020) underscored the importance of data pre-processing. The author posited that hallucinations could be substantially curtailed by meticulously sieving out factually inaccurate instances from the training data, thereby implying that the cleanliness of data plays an integral role in mitigating hallucinations. In a similar vein, Xu et al. (2023) further examined the internal mechanics of hallucinations in neural machine translation by analyzing token contributions. Their introspective study highlighted that the presence of erroneous instances in training data can drastically influence token-level contributions, culminating in hallucinated outputs. Collectively, these studies illuminate the profound influence of training data quality and composition on the propensity of LLMs to hallucinate, thereby underlining the imperative of rigorous data preprocessing and scrutiny.

Information redundancy Excessive redundancy in training data can lead the model to disproportionately emphasize certain viewpoints or pieces of information, resulting in knowledge bias and increasing the tendency for hallucination.

In a quest to understand the effects of data quality on the efficacy of language models, Lee et al. (2021) investigated the impact of deduplicating training datasets. Their work elucidates the discernible benefits of training models on deduplicated datasets as opposed to their original, non-deduplicated counterparts. One of the primary findings from their study revealed that models trained on deduplicated data exhibited reduced instances of memorized text, leading to more diverse and coherent outputs. Furthermore, when subjected to a battery of downstream tasks, encompassing NLI, sentiment analysis, and summarization, these deduplicated models consistently achieved higher or at least comparable performance metrics relative to models trained on the original datasets. Notably, this enhanced performance was achieved with fewer training steps. Their findings underscore the idea that eliminating repetitive data points in training datasets is not merely a data preprocessing step, but rather a pivotal strategy to augment the performance and efficiency of language models. Such insights could be instrumental in the context of reducing hallucinations in large models, as repetitive information could arguably bias a model to produce redundant or overfit outputs.

4.2.2 Model level

Model architecture Weaker model architectures may lead to more severe hallucination problems in LLM. The architecture and size of LLMs have emerged as potential factors influencing their susceptibility to hallucinations. Elaraby et al. (2023) ventured into the realm of weaker open-source LLMs, particularly focusing on BLOOM 7B (Workshop et al. 2022) as a representative model. The researchers posit that LLMs with reduced parameter counts, while being open-source, tend to manifest heightened rates of hallucinations compared to their more extensive counterparts. To tackle this, they introduced the HaloCheck framework, a tool designed to systematically quantify the severity of hallucinations experienced by these LLMs. Beyond diagnostic tools, Elaraby et al. (2023) also embarked on a quest for solutions, exploring innovative techniques such as knowledge injection and leveraging teacher-student paradigms to counteract hallucinations in these low-parameter LLMs. This pivotal research accentuates the importance of considering the trade-offs between model size and hallucination tendencies, especially as the NLP community gravitates towards more lightweight, open-source models for broader accessibility.

Decoding algorithms Studies indicate that employing sampling algorithms with greater uncertainty can predispose LLMs to produce hallucinations. Introducing randomness into the decoding process can sometimes result in the creation of imprecise or illogical text.

Decoding algorithms in LLMs have recently come under scrutiny for their potential influence on the generation of nonfactual or hallucinated information. Lee et al. (2022) thoroughly examined this very issue, examining the factuality of text produced by LLMs and identifying inherent pitfalls in prevailing decoding algorithms, notably the nucleus sampling algorithm. Lee and colleagues highlighted how this algorithm, during open-ended text generation, introduces “uniform randomness” at every decoding step. This randomness, they argue, can culminate in erroneous merging of disparate named entities or even in the outright invention of data, ultimately compromising the factual integrity of the resultant text. Recognizing the gravity of this challenge, the authors proposed the “factual-nucleus sampling” algorithm. Tailored to bolster the factuality of generated content, this new algorithm simultaneously ensures the preservation of text quality and diversity, thereby addressing the pitfalls associated with the conventional nucleus sampling’s indiscriminate randomness. Lee et al. (2022) underscore the pivotal role of decoding algorithms in shaping the accuracy and reliability of outputs from LLMs, spotlighting the pressing need to refine these techniques to curb hallucinations.

Exposure bias A significant factor leading LLMs to hallucinate is exposure bias. It arises from the disparity between the model’s training and generation phases. When a model is trained on a static dataset but tasked with generating text based on its prior outputs-especially during extended responses-this can result in error compounding. The model’s errors are not rectified or penalized, compromising the quality of its output. Moreover, exposure bias affects the model’s proficiency in processing seldom-seen or novel words, phrases, or scenarios. Instead of interpreting the input’s semantics or logic, the model might over-rely on the statistical patterns from its training data.

Exposure bias has emerged as a pivotal concern in Neural Machine Translation (NMT), especially due to its hypothesized connection with hallucinations, particularly during domain shifts. A seminal investigation by Wang and Sennrich (2020) meticulously unravels this intricate relationship. Wang and Sennrich (2020) establish that exposure bias can fuel the NMT system’s proclivity to generate hallucinations, which manifests as translations bearing minimal relevance to the original input, especially under conditions of domain shift. This finding was empirically corroborated through rigorous experiments on three distinct datasets spanning multiple test domains, which conclusively showed that hallucinations are, at least in part, a consequence of exposure bias, particularly pronounced during domain shifts. Venturing beyond mere diagnostic insights, the authors propose a mitigation technique predicated on Minimum Risk Training. This strategy, by eschewing exposure bias, demonstrated a marked decline in hallucination instances during domain shifts. The revelations from Wang and Sennrich’s work spotlight the critical influence of exposure bias on the fidelity of NMT outputs, while simultaneously charting a pathway towards potential remediation through innovative training methodologies.

4.3 Hallucination evaluation

While in previous surveys on the issue of hallucinations in LLMs, hallucination evaluation and hallucination detection were treated as two distinct aspects (Zhang et al. 2023), hallucination evaluation now can be regarded as a broader concept that encompasses hallucination detection. It involves the identification of fictitious or false information in generated text, while also including the assessment of the overall quality and logical coherence of the generated content. In hallucination evaluation, the focus extends beyond merely determining the presence of fictitious information to encompass the evaluation of general text quality, context coherence, and the relevance of information. This constitutes a more comprehensive process for assessing text quality.

Similar to bias evaluation, before general-purpose generative LLMs become mainstream, research primarily focuses on hallucination evaluation for specific downstream tasks such as text summarization (Kryściński et al. 2019; Maynez et al. 2020; Nan et al. 2021; Scialom et al. 2021), generative QA (Durmus et al. 2020), translation (Zhou et al. 2020; Guerreiro et al. 2023), and data-to-text generation (Wang et al. 2020; Dhingra et al. 2019). Evaluating these tasks typically only requires assessing the faithfulness of the generated content, ensuring that the target text does not conflict with the input. However, with the ubiquity of general-purpose LLMs and their ability to quickly adapt to various downstream tasks through prompts (Brown et al. 2020), there is growing concern in the community regarding the trustworthiness and utility of model-generated content. Motivated by this, more and more research starts to focus on the evaluation of the factuality of generated content (Lin et al. 2022; Lee et al. 2022).

Following sections begin with a review of recent evaluation metrics, especially on widely adopt non-task-specific LLMs’ generation, followed by the taxonomies of evaluation methods and existing benchmarks.

4.3.1 Evaluation metrics and methods

Previous works on specific tasks usually adopt traditional metrics such as BLEU (Papineni et al. 2002), ROUGE (Lin 2004), and METEOR (Banerjee and Lavie 2005) to measure the quality of generated content. However, these metrics, which rely on n-gram to quantify the similarity between generated text and reference text, face challenges in evaluating the level of hallucination (Dhingra et al. 2019; Durmus et al. 2020). Therefore, researchers have shifted towards model-based evaluation. Due to the flexibility of model-based evaluation methods, their corresponding evaluation metrics are not directly comparable, as they calculate metrics based on different aspects. We make a taxonomy of existing evaluation methods and provide the metrics used by each evaluation method. We present explanations of each evaluation method in Fig. 6. Each class will be thoroughly examined in the subsequent discussions.

Fig. 6
figure 6

Examples of three classes of hallucination evaluation methods

Human evaluation Evaluating hallucination in current LLMs poses significant challenges due to their capability to generate diverse and contextually relevant text, making it difficult to distinguish between factual and misinformation. Therefore, the most commonly used and reliable evaluation method involves human experts following specific guidelines (Santhanam et al. 2021; Shuster et al. 2021; Wu et al. 2021; Lin et al. 2022; Lee et al. 2022; Min et al. 2023; Li et al. 2023). Santhanam et al. (2021) and Shuster et al. (2021) employ human annotation to perform binary classification on whether models exhibit hallucinations. They also utilize a simple hallucination rate which refers to the percentage of answers that exhibit hallucinations out of all generated answers to assess the hallucination degree in the models. Lin et al. (2022) design an evaluation procedure, which requires evaluators to assign one of 13 labels to an answer. They map a truth score to each label and calculate the truthfulness score. The truthfulness score for the question is the total normalized likelihood of the true answers. Liu and Wan (2023) introduce a more elaborate approach, where evaluators are involved in a three-level (paragraph-level, sentence-level, and word-level) factuality annotation process for each generated output. FActScore (Min et al. 2023) breaks a generation into atomic facts, which are short statement containing one piece of information each. After assigning each atomic fact a binary label, they calculate factuality precision to quantify the hallucination. While human evaluation is considered the most accurate evaluation criterion with the highest credibility and interpretability, it is labor-intensive and lacks reproducibility due to subjectivit across evaluators (Belz et al. 2022, 2023).

Statistical-based evaluation Conventional metrics like ROUGE and BLEU, which calculate the overlap between generated text and target text, are widely used as important metrics. Some studies conduct comparisons of the correlation between automatic metrics and human evaluation (Dhingra et al. 2019; Durmus et al. 2020; Lin et al. 2022; Lee et al. 2022; Liu and Wan 2023). They find that conventional metrics offer low correlation with human judgement for evaluating hallucination in the generated content, illustrating that these metrics may not be suitable in this field. Another metric that utilizes the n-gram approach is PARENT (Dhingra et al. 2019), which calculates using the reference text instead of the target text, as the target text may not always contain complete information to support the generated text. This metric aligns more closely with human judgment. Shuster et al. (2021) employ Knowledge F1, which is a variant of unigram F1, to measure the overlap between the model’s generation and the ground-truth human response with the knowledge on which the human grounded during dataset collection. They also propose Rare F1, which only considers infrequent words in the dataset when calculating F1. Yu et al. (2023) develop a self-contrast metric to assess a model’s ability in factual generation by contrasting two completions from the same context: one without foreknowledge another with it. This metric also utilizes human-written succeeding text to prevent evaluation collapse and employs Rouge-L (F1) score.

Model-based evaluation Model-based evaluation refers to methods that use additional neural models to assist in hallucination evaluation of evaluated model. Methods that involve altering model inputs to obtain different generations (Yu et al. 2023) do not align with the concept of model-based evaluation. A simple and representative method is to train a model to classify generations based on additional information (Lin et al. 2022; Cheng et al. 2023). Lee et al. (2022) combines named-entity based metric and textual entailment based metric to capture a different aspect of factuality. Named-entity based metric focuses on detecting factual errors related to named entities using a named-entity detection model, while textual entailment based metric assesses whether ground-truth knowledge entails model’s generations using a NLI model. The idea of using NLI to assess hallucination has been adopted by many studies (Falke et al. 2019; Kryściński et al. 2019; Honovich et al. 2021; Mishra et al. 2021; Lee et al. 2022; Laban et al. 2022). However, the “neutral” label in NLI often fails to explicitly indicate hallucination in generated content. Nevertheless, many studies still interpret the neutral label as indicative of hallucination from the perspective of faithfulness. Besides introducing this transfer-style approach into hallucination evaluation, another research line focuses on incorporating additional information to the inputs before utilizing commercial LLMs. FActScore (Min et al. 2023) leverages a retrieval model to gather passages from the given knowledge source. Then, they prompt the knowledge-augmented input to LLMs, such as ChatGPT (OpenAI 2022), to judge whether or not a statement is true. Self-Checker (Li et al. 2023) decomposes the fact-checking process into modular steps: claim processing, query generation, evidence retrieval, and verdict prediction. This is a process that introduces additional information for the model to self-check. Model-based evaluation methods has now become the primary proxy for human evaluation. However, the neural models can be subject to errors that can propagate and adversely affect the accurate quantification of hallucination (Ji et al. 2023). This kind of evaluation method still offers significant research potential.

4.3.2 Evaluation benchmarks

This section presents widely used and newly proposed benchmarks for evaluating hallucination in LLMs, which are shown in Table 3. It can be observed that, unlike benchmarks for evaluating bias, benchmarks for evaluating hallucination are often tied to specific tasks. Among them, benchmarks that use generation as the evaluation task (Min et al. 2023; Yu et al. 2023; Lee et al. 2022) assess the models’ ability to generate factual text. In benchmarks that use QA as the evaluation task, besides methods that consider coherence and fluency of generation similar to generation tasks (Lin et al. 2021), there are also methods that present models with choices to make selections in a generative manner (Li et al. 2023; Lin et al. 2021; Rashkin et al. 2023; Elaraby et al. 2023). Existing benchmarks also vary significantly in terms of evaluation methods. Some benchmarks involve human evaluations for the generated content, while others employ automatic evaluation methods. For instance, FACTOR (Muhlgay et al. 2023) assesses the model’s ability to assign a higher likelihood to the original factual completion than to any of the false variations, which is primarily a process related to language modeling. The approach of introducing variations through InstructGPT for the model to make multi-choice evaluations is model-based. Most benchmarks utilize prior datasets or Wikipedia to construct their own datasets. Some benchmarks even employ advanced LLMs, such as ChatGPT and InstructGPT, for data generation (Li et al. 2023; Muhlgay et al. 2023). Instead of presenting a new dataset, AIS (Rashkin et al. 2023) just puts forth a set of evaluation standards.

Table 3 Widely used benchmarks that used for evaluating hallucination in LLMs

4.4 Dehallucinating methods

This section delineates the strategies employed to mitigate hallucinations in LLMs. Similar to debiasing methods, a taxonomy of dehallucinating methods is given. We classify them into pre-processing for data, in-processing during training, intra-processing without training, and post-processing during inference, which are illustrated in Fig. 7. Within these four categories, we further categorize various dehallucinating methods, with their corresponding research shown in Table 4. In the following sections, we will introduce and discuss these methods outlined in the table.

Fig. 7
figure 7

Illustration of our taxonomy of mitigation states for dehallucinating

Table 4 Dehallucinating methods

4.4.1 Pre-processing for data

The core focus in the pre-training phase of data involves improving data quality and constructing high-quality datasets. Penedo et al. (2023) crawl text data from the Web, implementing content filtering and deduplication to develop RefinedWeb, a high-quality and diverse dataset. Utilizing this dataset, they trained two models of varying sizes, Falcon-7B and Falcon-40B. These models were benchmarked against other open-source models in many NLP tasks like QA, text summarization, and dialogue. The results demonstrated that both Falcon-7B and Falcon-40B (Penedo et al. 2023) could match or surpass the performance of other models. Li et al. (2023) leverage existing textbook data to generate high-quality datasets for model training. The models trained on these datasets showed performance on par with other models that have five times more parameters in some NLP tasks, highlighting the efficacy of quality over quantity in training data. In a similar vein, Touvron et al. (2023) utilize public source data, meticulously selecting and curating it to build a database of superior quality. The Llama 2 model, trained on this database, exhibited commendable performance, underscoring the value of curated data sources. Furthermore, Lee et al. (2022) contribute significantly by designing a test set called FactualityPrompts. This set aims to measure the factuality of texts generated by pre-trained language models. Such a tool is invaluable in evaluating and addressing the challenges of hallucination in LLMs, providing a means to systematically assess and improve the reliability of generated content.

However, the pinnacle of this research is the successful reduction of hallucinations achieved by skillfully fine-tuning models such as MiniGPT-4 (Zhu et al. 2023) and mPLUG-Owl (Ye et al. 2023) using the LRV-Instruction dataset (Liu et al. 2023). This fine-tuning process not only reduces the incidence of hallucinations but also enhances the models’ performance across various benchmark datasets, notably with less training data. Their results highlight the effectiveness of well-balanced dataset compositions in developing more robust models, firmly establishing the principle that the quality of the dataset is crucial in dehallucinating issues in LLMs.

4.4.2 In-processing during training

During the model training phase, methods can generally be categorized into two main approaches: supervised fine-tuning (SFT) and reinforcement learning from human feedback.

SFT SFT is a technique that utilizes pre-trained language models to adapt to specific downstream tasks. It uses labeled data to tune the parameters of the model. Zhou et al. (2023) introduce LIMA, a novel approach utilizing 1k meticulously selected prompts and responses for fine-tuning the model, which yields impressive outcomes. In a specific test, LIMA matches the performance of GPT-4 and even surpassed it 43% of the time. Chen et al. (2023) employ a powerful language model, like ChatGPT, to automatically filter out low-quality data. They fine-tune a new model, AlpaGasus, on just 9k high-quality data points selected from 52k Alpaca data. AlpaGasus demonstrates significant improvement over the original Alpaca-based model and matches or exceeds GPT-4’s performance in multiple test sets. Cao et al. (2023) propose InstructMing, an innovative method capable of autonomously selecting high-quality instructional tracking data to fine-tune LLMs. Elaraby et al. (2023) develop HaloCheck, a lightweight, black-box, knowledge-free framework for quantifying hallucination severity in LLMs. Remarkably, HaloCheck can accurately estimate hallucination intensity and provide a score without accessing the internal structure or operational principles of the LLM, thus aiding in the mitigation of hallucinations. Shi et al. (2023) explore the efficacy of incorporating common instruction adjustments in building specialized models. Their experimental evaluation across four target tasks with varying coverage levels demonstrates that when task coverage is broad, integrating common instructional adjustments can further enhance model performance. This provides systematic guidance for developing specialized models with general instruction adjustments. Sun et al. (2023) introduce SynthData, a synthetic data generation method based on GPT-2. SynthData can generate a substantial volume of synthetic data from user inputs and dialogue history for fine-tuning dialogue language models. This method has been shown to effectively improve the generalization ability and robustness of DLMs, outperforming existing methods in certain tasks. Jones et al. (2023) propose SynTra. By using this method for fine-tuning, the degree of hallucination is successfully reduced for two 13B LLMs.

RLHF RLHF is a subfield of AI that integrates human guidance with machine learning algorithms. Its primary aim is to enable AI systems to learn from human preferences or expectations, thereby enhancing their adaptability to complex and uncertain tasks. Ouyang et al. (2022) propose using human feedback to fine-tune language models, making them more aligned with user intentions. Lightman et al. (2023) compare process supervision and result supervision in training and find that process supervision significantly outperforms result supervision. They also release the PRM800K dataset for training, containing 800,000 step-level human feedback labels. Wu et al. (2023) introduce fine-grained RLHF, a novel RLHF framework. This method is unique in its granular approach, training and learning from nuanced reward functions in two aspects: (1) Density, giving rewards after generating each text fragment, and (2) Combining vs. Distinct, using multiple reward models corresponding to different feedback types (e.g., factual inaccuracy, irrelevance, and incompleteness). Sun et al. (2023) propose factually augmented RLHF, which enhances the reward model with human feedback and uses additional factual information like image descriptions and real multiple-choice options to reduce reward gaming in RLHF and improve performance. Li et al. (2023) develop Themis, a tool-augmented preference modeling method. Themis promotes synergy between tool use and reward scoring and enhances the explanatory power and reliability of scoring. Despite its potential, RLHF does not always work effectively. To test the impact of RLHF on the GPT-4 base model, various tests are conducted on both the base model and the post-RLHF GPT-4 model. The results show that the average score across all tests is 73.7% for the base model and 74.0% for the RLHF model, indicating no substantial change in the capabilities of the GPT-4 base model after RLHF training. Hosking et al. (2023) explore the limitations of human feedback in evaluating LLM performance and its use as a training target. They argue that human feedback is subjective and unreliable due to personal biases and error annotations. They use a model of instruction adjustment to generate text with varying degrees of confidence and complexity, finding that confidence affects the perception of factual errors, suggesting that human feedback cannot fully represent authenticity.

4.4.3 Intra-processing without training

The intra-processing methods without a training phase can be categorized into four key aspects: designing decode strategy, resorting to external knowledge, pre-detecting and preventing, and multi-agent interaction.

Designing decode strategy Designing effective decoding strategies can significantly improve the performance of language models. Lee et al. (2022) propose a sampling algorithm called factual-nucleus, which can dynamically adjust randomness to improve the factuality and quality of generated text. Mallen et al. (2023) propose a new retrieval enhancement method that can only retrieve non-parametric memories while maintaining the performance of LLMs and reducing the cost of reasoning. This method can help LLMs better handle problems that require rich world knowledge. Context-aware decoding (Shi et al. 2023) specifically addresses the incorporation of contextual information in content generation, resolving knowledge conflicts effectively. Inference time intervention technology (ITI) (Li et al. 2023) stands out as a method that influences the model’s generated content by adjusting activation vectors during inference. DoLa (Chuang et al. 2023) leverages factual knowledge from LLM transformation layers to improve accuracy in word prediction, demonstrating its ability to reduce the generation of false facts. Contrasting with these intricate methods, Sennrich et al. (2023) offer a simpler approach by modifying the decoding target to mitigate hallucinations and off-target translations. Chain-of-verification (Dhuliawala et al. 2023) effectively reduces hallucination rates and enhances the accuracy and credibility of responses. Knowledge-Constrained Tree Search (Choi et al. 2023), guides models to generate text consistent with reference knowledge at each decoding step. Chen et al. (2023) propose a new decoding method called fidelity-enriched contrastive search, which can improve the semantic similarity to the provided source while maintaining the diversity of the generated text, thereby reducing the hallucination problem. Mitchell et al. (2023) propose emulated fine-tuning technology, which is used to combine the knowledge learned by LLMs in the pre-training stage with small language models, combined with the knowledge learned during the fine-tuning phase.

Resorting to external knowledge Incorporating external knowledge has become a key strategy for enhancing the content generation capabilities of LLMs. LLM-Augmenter (Peng et al. 2023) is a system designed to enable LLMs to generate more useful and accurate answers by tapping into external knowledge sources, such as task-specific databases. This approach significantly enhances the utility and precision of LLM responses. Jin et al. (2023) propose the GeneGPT to teach LLMs how to use the Web API of the National Center for Biotechnology Information to answer genomics questions. The CRITIC framework, proposed by Gou et al. (2023), allows LLMs to review and refine their own outputs through interactions resembling human-like engagement with external tools. This interactive process facilitates a more dynamic and self-improving generation process. Luo et al. (2023) introduce parametric knowledge guiding, a framework offering a knowledge guidance module for LLMs. It enables access to relevant knowledge in real-time without altering the LLMs’ parameters, thus boosting their performance. Feng et al. (2023) propose the knowledge solver method, which can teach LLMs to search domain knowledge from knowledge graphs, thereby helping the model better understand the context and improve accuracy when performing tasks. Qian et al. (2023) propose a systematic framework to reveal different knowledge structures of LLMs by constructing parameterized knowledge graphs and introduce external knowledge through disruptors of different degrees, methods, positions and formats. Binary token representations (Cao et al. 2023) aims to enhance the efficiency and performance of retrieval-augmented language models. This approach signifies an advancement in integrating retrieval mechanisms with language models. Vu et al. (2023) propose a simple few-prompt method called FRESHPROMPT, which can improve the performance of LLMs on the dynamic question answering benchmark FRESHQA by retrieving relevant and latest information from search engines.

Pre-detecting and preventing Pre-detecting and preventing is to predict content that may cause hallucinations during the content generation process and prevent it. Varshney et al. (2023) utilize the model’s logit output values to identify candidates for potential hallucinations, checks their correctness through a validation procedure, mitigates the detected hallucinations, and then lets the model continue the content generation process. Luo et al. (2023) propose a pre-detection self-assessment technique called SELF-FAMILIARITY, which focuses on assessing the familiarity of concepts in input instructions and refuses to generate responses when encountering unfamiliar concepts. Yuksekgonul et al. (2023) propose the SAT Probe method, which can predict the degree of constraint satisfaction and factual errors. Li et al. (2023) proposed the ITI, which can change the model activation state during the inference process and allow the model to generate content along a specific direction. Ishibashi and Shimodaira (2023) fine-tunes the model so that it can generate harmless answers when answering questions that may involve sensitive information.

Multi-agent Interaction Multi-agent interaction allows multiple agents to interact to improve the quality of answers. Du et al. (2023) introduce a strategy that incorporates elements of social awareness and multi-agent dynamics. In this approach, multiple language model instances (or agents) individually answer or debate a given question, eventually reaching a common best answer. Cohen et al. (2023) design a multi-round interaction framework that allows one language model serving as an examiner to ask questions of another language model in order to identify contradictions. Wang et al. (2023) propose a method named Solo Performance Prompting, which transforms a LLM into a cognitive collaborator by engaging in multiple rounds of interaction with various characters, to tackle complex tasks. Li et al. (2023) use LLMs as agents for multi-agent collaboration and evaluate their performance in Theory of Mind inference tasks.

4.4.4 Post-processing during inference

The methods in the Post-processing during inference stage can be divided into three categories: detecting and revising, human-in-the-loop, and analyzing internal model states.

Detecting and revising Detecting and Revising is to detect and modify the hallucinated parts of the content after the model generates the content. Gao et al. (2023) propose a system called RARR, which can automatically find and edit the output of LLM in the later stages of text generation to improve its credibility and accuracy. Huang et al. (2023) propose a zero-shot method for correcting factual errors in input statements. Chen et al. (2023) propose a fully unsupervised method called PURR to effectively edit erroneous or unreasonable information generated by language models. Zhao et al. (2023) propose a framework for chain-of-thought prompts called Verify-and-edit. Its goal is to improve the factuality of generated content based on external knowledge during post-editing. Li et al. (2023) propose the Self-Checker framework, a framework composed of a series of pluggable modules that can implement fact checking by simply asking questions to LLMs, without the need for fine-tuning the model. Chern et al. (2023) propose a tool called FacTool, which is a task- and domain-agnostic framework for detecting factual errors in texts generated by LLMs (e.g., ChatGPT). FLEEK, developed by Bayat et al. (2023), can automatically extract factual statements from text, collect evidence from external knowledge sources, evaluate the factuality of each statement, and use the collected evidence to recommend corrections to incorrect statements. Manakul et al. (2023) introduce SelfCheckGPT, a model that can be used to detect black-box hallucinations in generative LLMs (such as GPT-3). Mündler et al. (2023) propose a novel hint-based framework that can effectively detect and eliminate self-contradictory content. This framework is suitable for black-box language model and requires no external basic knowledge. Agrawal et al. (2023) propose a simple search engine query method that can effectively identify fictitious citations and can be used to evaluate the performance of LLMs. Zhao et al. (2023) propose a method based on Pareto-optimal self-supervision, which can leverage existing procedural supervision to systematically perform risk assessment on answers to LLMs and evaluate them based on each. The risk scores of each response are calibrated without additional manual intervention. Yang et al. (2023) propose an uncertainty-aware context learning framework that allows the model to adjust content or reject content output based on uncertainty.

Human-in-the-loop Through the interaction between people and the model, the performance of the model can be improved. Zhang et al. (2023) propose a framework called MixAlign, which can interact with users and knowledge bases to obtain and integrate the relationship between user questions and stored information when LLMs generate answers. Experimental results show that MixAlign can significantly improve model performance and reduce hallucinations compared to existing methods. Dou et al. (2023) introduce the role of human and AI interaction in text generation.

Analyzing internal model states Analyzing the internal state of the model can improve the transparency of the model, facilitate human understanding, and lay the foundation for mitigating model hallucinations. Zou et al. (2023) train a classifier that can output the probability of whether a statement is true based on the activation value of the hidden layer of LLM when reading or generating the statement. This method utilizes the internal state of LLM to determine whether a statement is true. Azaria and Mitchell (2023) introduce the Representation Engineering method. This approach draws on cognitive neuroscience and could make AI systems more transparent.

5 Comparative analysis of bias and hallucination in LLMs

In this section, we analyze and compare bias and hallucination in LLMs from various perspectives. We explore both the similarities and differences between these two problems.

5.1 Contributors

Bias primarily stems from the data used to train LLMs, while hallucination, in addition to being attributed to insufficient data, can extend to a variety of factors, including generation strategies and fine-tuning methods, which can induce hallucination in LLMs.

It is evident that data plays a pivotal role in the realm of LLMs. Bias may manifest during various stages, including data sampling, text recognition, or data filtering and cleansing. In the pre-training phase, substantial knowledge is assimilated by LLMs from extensive training data, subsequently encoded within their model parameters. Consequently, when confronted with inquiries or tasks, LLMs may exhibit instances of hallucination if they lack pertinent knowledge or have internalized erroneous information from the training corpus.

The selection and filtration of textual data assume paramount significance in addressing the aforementioned dual challenges. Regarding bias-related concerns, the choice of textual content significantly influences the model’s behavior, as specific categories of text may introduce a spectrum of biases, including societal biases (Navigli et al. 2023). Conversely, within the realm of hallucination issues, misleading or inaccurate information might inadvertently be incorporated into the training data, detrimentally affecting the model’s performance. The predominant origins of bias issues in LLMs are inextricably linked to the characteristics of the underlying data. Although contemporary language models undergo training on vast corpora, the documents comprising their training datasets represent but a subset of the available textual material on the World Wide Web. Even supposing one could bear the resource-intensive endeavor of training language models on the entire expanse of the Web, the ultimate systems thus created would still exhibit manifestations of bias.

For instance, it is noteworthy that a substantial proportion of presently prevalent pre-trained models employ Wikipedia as their primary training dataset. While Wikipedia is generally esteemed within the NLP research community as a repository of high-quality information, it is characterized by a disproportionate preponderance of articles pertaining to geographical, sporting, musical, cinematic, and political topics, greatly outnumbering contributions related to the domains of literature, economics, and history. This unequal distribution of data facilitates a proclivity in models to acquire knowledge that disproportionately aligns with the domains overrepresented in the data, potentially leading to unintended outcomes such as the manifestation of gender biases within the model. In addition, the training data for extensive models may inadvertently incorporate information characterized by inaccuracy or lack of fidelity. For example, the temporal nature of corpora may lead to ethical annotation processes necessitating substantial resources and proficient annotators. Frequently, researchers choose to leverage existing datasets instead of engaging in the resource-intensive task of re-annotation (Izsak et al. 2021). Unfortunately, retraining language models demands not only substantial temporal and financial investments but also the procurement of proficient annotators. In light of these considerations, training models with data characterized by these imperfections may embed erroneous convictions within the model, consequently leading to the inadvertent generation of misleading information.

5.2 Evaluation methods

As for now, there is no optimal method for either bias evaluation or hallucination evaluation. Observing the taxonomies and specific methods under each category in Sects. 3.2.1 and 4.3.1, it appears that methods for bias evaluation are more mature and systematic compared to those for hallucination evaluation. A primary reason for this is the relatively straightforward nature of identifying biases, such as gender bias typically correlating with gendered words, and political bias often aligning with national names, where most instances involve judgment of bias between words. In contrast, hallucination evaluation tends to be more complex. Intrinsic hallucination evaluation is somewhat manageable, as answers can usually be found in the context. However, evaluating extrinsic hallucination remains challenging, especially in open-domain scenarios where even humans may struggle to identify hallucination.

Furthermore, when comparing the content of both sections, it becomes clear that numerous statistical-based and model-based methods in hallucination evaluation can be categorized as generation-based when compared to the bias evaluation taxonomy. This is evident in studies like those by Yu et al. (2023) and Cheng et al. (2023). Therefore, it is clear that the taxonomy used for categorizing methods in bias and hallucination evaluations are not consistent. The taxonomy for bias evaluation is more specific, with most methods being traditional and not deviating significantly in their approach. In contrast, the classification for hallucination evaluation is broader, as many existing methods are often heuristic and draw inspiration from other fields.

5.3 Mitigation methods

After the discussions in Sects. 3.3 and 4.4, approaches to addressing bias and hallucination can be categorized into three main classes: pre-processing from data, in-processing during training, and post-processing during inference.

Represented by CDA (Zmigrod et al. 2019; Zhao et al. 2018; Webster et al. 2020; Barikeri et al. 2021), data optimization for addressing bias primarily focuses on balancing datasets. Various methods are employed to reduce biased information within the dataset, preventing the model from learning excessive biased information. Similarly, to tackle hallucination issues, the data preprocessing phase mainly involves integrating methods such as data cleaning, data augmentation, and thoughtful dataset development to enhance data quality. This ensures the accuracy and relevance of the dataset, thereby minimizing the occurrence of hallucination.

In contrast to the data processing stage, the optimization of the model training phase exhibits significant distinctions in debiasing and dehallucinating strategies. Within the in-processing phase of training, debiasing methods, such as the incorporation of regularization terms, the constraining of model output, and the introduction of supplementary loss functions, primarily aim to forestall the model from acquiring and magnifying inherent biases present in the dataset. Post-processing debiasing methods can be categorized into projection-based, tuning-based, and probation-based approaches. The majority of these techniques predates the widespread adoption of LLMs, underscoring the longstanding historical concern of bias in machine learning models. Hallucination, conversely, primarily emanates from inherent deficiencies within LLMs themselves. Presently, numerous advanced LLMs are proprietary, rendering them inaccessible for scrutiny. Consequently, prevailing methods for mitigating hallucination predominantly concentrate on optimizing non-model parameters.

6 Challenges and future directions

In this section, we briefly review some of the challenges encountered by LLMs in addressing hallucination and bias issues, as well as the future development trends for the reference of future researchers.

6.1 Bias in LLMs

Even though there have been various research efforts aimed at addressing bias in LLMs, this field still faces numerous challenges and future opportunities.

Side effects While some debiasing techniques have shown remarkable effectiveness in mitigating specific biases such as gender, race, and religion, concerns about their potential side effects on language modeling capabilities have arisen. Studies have shown that certain debiasing techniques can affect model performance (Meade et al. 2022). Additionally, the inherent noise and limitations of bias benchmarks have made it challenging to evaluate the effectiveness of these techniques. There is still a lack of well-developed research explaining how these debiasing methods affect model performance. Evaluating and explaining how existing debiasing methods affect models could be a promising direction.

Understanding and measuring bias Most of the existing bias evaluation metrics have been for specific categories of bias, such as gender bias (Czarnowska et al. 2021). Given the content generated by current open-domain LLMs, it is not sufficient to perform bias analysis on a single category. A more comprehensive bias evaluation method is needed.

Multilingual and multi-cultural background The current research mainly focuses on English language models. Expanding these techniques to other languages and cross-cultural scenarios represents a significant direction for ongoing efforts (Joshi et al. 2020), which in turn brings many challenges to the bias problem. It s crucial to acknowledge that bias is a multidimensional issue that encompasses complex social and cultural factors. For instance, the understanding of what constitutes bias can vary across different cultural backgrounds. In a multicultural country, various religious and cultural groups have different views and taboos regarding food, clothing, customs, etc. Consider India, a country where diverse cultures and religious beliefs coexist, including but not limited to Hindus, Muslims, Christians, and Sikhs. Hindus might view consider the consumption of beef as taboo, Muslims typically avoid pork, and Christians may not adhere to such dietary constraints. This cultural diversity becomes even more evident when spanning multiple countries within a single nation. These differences between societies and cultures need to be fully considered and respected when designing LLMs. Consequently, future efforts should focus on multilingual and multi-cultural background in bias mitigation technology. Additionally, building upon open-source large-scale pre-trained foundation models, how to quickly and effectively adapt the model to different socio-cultural backgrounds presents a challenge.

Use bias wisely In fact, there is no dataset that is completely free of bias (Linzen 2020). Previous studies have indicated that common methods for removing gender bias in word embedding models are relatively superficial and often involve concealing bias rather than eliminating it (Gonen and Goldberg 2019). Prost et al. (2019) further demonstrate that traditional debiasing techniques might actually exacerbate bias in downstream classifiers by providing a clearer channel for transmitting gender information. Gardner et al. (2021) have shown that models are sensitive to very fine-grained biases, which are difficult to detect and filter. Meanwhile, other studies have shown that training on bias-filtered datasets does not necessarily lead to better generalization (Parrish et al. 2021). Recent research also suggests that it is possible to amplify dataset biases in the training set, thus promoting the development of the model’s robustness to subtle biases (Reif and Schwartz 2023). How to use these hard-to-eliminate biases in the dataset to make the model learn to debias is a research direction.

Addressing these issues will be no small task for the research community, biases come from human beings. It is important to be aware that biases are in our own society. Biases can prove to be valuable in specific situations or settings, provided that users understand their constraints and consider these limitations in their decision-making processes. Occasionally, the biases inherent in these models may actually reflect the real-world conditions in which they are applied, offering insights into significant social disparities that warrant attention at their foundational levels. Responsible utilization of biased AI models hinges on ensuring that users possess a clear comprehension of the potential biases and limitations linked to these models. This empowers them to make well-informed decisions regarding when and how to employ these models in different contexts. By acknowledging that biased models can be beneficial in specific scenarios and implementing measures to guarantee that users recognize and can address their limitations, we can advocate for the responsible use of AI technologies. In doing so, we can harness the advantages of AI while minimizing the associated bias-related risks.

6.2 Hallucination in LLMs

Like the issue of bias, there are still many difficulties and challenges in the hallucination of LLMs, which are reflected in the following aspects.

Evaluating hallucination The most reliable way to assess hallucination is human evaluation, although there has been a lot of research into making automated evaluation more accurate and effective (Lin et al. 2021; Min et al. 2023; Zha et al. 2023; Mündler et al. 2023). However, there are still many differences between the current automated evaluation and human evaluation (Lin et al. 2021; Muhlgay et al. 2023; Min et al. 2023). At the same time, for text generated by different LLMs, or text generated by the same LLMs in different domains, the reliability of the automated evaluation fluctuates greatly (Min et al. 2023). These problems have yet to be resolved.

Model editing Hallucination in LLMs mainly stem from the memory of incorrect information or the absence of correct factual knowledge. Model editing (Sinitsin et al. 2020; De Cao et al. 2021) aims to solve these problems. This involves modifying the behavior of the model in a data and computationally efficient manner. Currently, there are two mainstream paradigms for model editing, including introducing auxiliary subnetworks (Mitchell et al. 2022; Huang et al. 2023) or directly modifying the original model parameters (Meng et al. 2022). This technique may help to eliminate hallucination in LLMs by editing the stored factual knowledge. However, this emerging field still faces many challenges, including editing black-box LLMs, contextual model editing (Zheng et al. 2023), and multi-hop model editing (Zhong et al. 2023).

Problems of RLHF Human feedback has currently become the de facto standard for evaluating the performance of LLMs and is increasingly being used as a training objective. However, recent studies have shown that human annotations are not fully reliable evaluation metrics or training targets, and that using human feedback as a training target disproportionately increases the confidence in model outputs (Hosking et al. 2023), and this confidence is often what causes the model to hallucinate. With human involvement, human annotators tend to look for shortcuts to make the task easier, so they are more likely to base their judgments on surface attributes such as fluency and language complexity, rather than expending more effort on detecting authenticity. Testing ChatGPT revealed a preference for the verbose and “chatty” style of responses it generated (Kabir et al. 2023), LLMs trained on RLHF tend to be flattering (Perez et al. 2022).

Multilingual and multi-cultural background LLMs may perform poorly in contexts other than English (Ahuja et al. 2023; Lai et al. 2023). Some low-resource languages may suffer from hallucination problems (Guerreiro et al. 2023), one potential research direction is to address these multilingual problems. In order to improve the performance of multimodal tasks in complex scenes, LLMs have also been used in a variety of tasks, and studies have shown that the hallucination issue is inevitable in these multimodal tasks (Li et al. 2023; Liu et al. 2023; Wu et al. 2023; Su et al. 2023; Maaz et al. 2023), it would be interesting to address these hallucination that arise in areas such as images, video, audio and so on.

Addressing the mentioned problems could be a future direction for the hallucination problem in LLMs. Also, recent research suggest that prompts characterized by greater formality and concreteness tend to result in reduced hallucination(Rawte et al. 2023). Users need more instruction to learn how to use LLMs to reduce the hallucination problem.

6.3 Other problems

We primarily summarize the issues of bias and hallucination in LLMs. However, it is worth noting that there are some other concerns related to trustworthiness of LLMs. These concerns will be briefly discussed in this section, offering references for future research.

  • Data privacy security Recent significant advancements in LLMs are largely due to the extensive amount of training data crawled from the internet. This data, sourced from websites, social media platforms, and other public text data, may include personal information like names, ages, genders, occupations, hobbies, and social connections. There’s a risk that LLMs could unintentionally learn and memorize this information, potentially leading to the leakage of sensitive personal data in their outputs (Huang et al. 2022). Currently, there is no guaranteed safeguards against the accidental leakage of Personally Identifiable Information (PII). There is a lack of understanding regarding the probability and mechanisms of PII leakage, especially under specific prompting conditions. In 2021, Google’s Carlini and others proposed methods to extract training data from GPT-2, demonstrating that LLMs may reveal some users’ real identities or private information when generating text (Carlini et al. 2021). To protect user data security, it is imperative for developers to implement robust measures to safeguard the privacy of the data employed in training these models.

  • Copyright violations Copyright infringement is also a significant challenges encountered in content generation by LLMs. These models may retain not only the knowledge present in the training data but also entire text segments observed during training (Karamolegkou et al. 2023). Copyright laws protect original materials from unauthorized use, but LLMs risk infringing these protections by potentially recreating copyrighted texts. This introduces complex copyright infringement concerns.

  • Jailbreak attacks In earlier versions of ChatGPT, jailbreak attacks could easily manipulate ChatGPT elicit undesired behavior. Although advanced LLMs such as GPT-4 have acquired a decent ability to generate proper responses to factuality-related queries. However, there are still some well-designed jailbreak prompts that break the security set by LLMs and produce undesirable content (Wei et al. 2023; Zou et al. 2023). Such content may violate local laws or be used for illegal activities, in which case the misuse of LLMs have serious consequences.

We aspire for future research to tackle these challenges, paving the way for LLMs to be truly safe and benign. By resolving these issues, we can harness the full potential of LLMs, utilizing them more effectively and responsibly.

7 Conclusion

Today’s LLMs are widely applied across various domains, yet they inevitably face issues of bias and hallucination. Especially for popular generative models, ensuring that their outputs are responsible is crucial. This survey presents a comprehensive study focused on debiasing and dehallucinating in LLM audits.

Beginning with definitions, this paper thoroughly explains and categorizes both bias and hallucination, highlighting that biases often manifest in specific forms such as gender or racial biases, while hallucinations are typically divided into intrinsic and extrinsic for detailed study. A taxonomy of evaluation metrics and methods for both bias and hallucination is presented. For bias evaluation, the taxonomy classifies methods based on their strategies in using the model under evaluation. For hallucination evaluation, the taxonomy classifies methods based on the dependencies of the evaluation methods. Additionally, this paper summarizes and presents widely used and newly published evaluation benchmarks for these issues. The paper then explores methods for debiasing and dehallucinating, again providing a taxonomy. This taxonomy classifies methods based on their intervention stages during mitigation.

This paper also compares these two significant issues in LLMs, analyzing the contributors to their emergence and contrasting their evaluation and mitigation methods. As mentioned in the last section, there are still many challenges in this field. Consequently, this paper concludes by suggesting future research directions based on the current challenges and emerging research trends. We hope that this work provides support for both existing and future research endeavors in this field.