Controlling hallucinations at word level in data-to-text generation

Rebuffel, Clement; Roberti, Marco; Soulier, Laure; Scoutheeten, Geoffrey; Cancelliere, Rossella; Gallinari, Patrick

doi:10.1007/s10618-021-00801-4

Controlling hallucinations at word level in data-to-text generation

Open access
Published: 22 October 2021

Volume 36, pages 318–354, (2022)
Cite this article

Download PDF

You have full access to this open access article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Controlling hallucinations at word level in data-to-text generation

Download PDF

Clement Rebuffel¹^na1,
Marco Roberti ORCID: orcid.org/0000-0003-1430-7006²^na1,
Laure Soulier¹,
Geoffrey Scoutheeten³,
Rossella Cancelliere² &
…
Patrick Gallinari^1,4

6284 Accesses
12 Citations
7 Altmetric
Explore all metrics

Abstract

Data-to-Text Generation (DTG) is a subfield of Natural Language Generation aiming at transcribing structured data in natural language descriptions. The field has been recently boosted by the use of neural-based generators which exhibit on one side great syntactic skills without the need of hand-crafted pipelines; on the other side, the quality of the generated text reflects the quality of the training data, which in realistic settings only offer imperfectly aligned structure-text pairs. Consequently, state-of-art neural models include misleading statements –usually called hallucinations—in their outputs. The control of this phenomenon is today a major challenge for DTG, and is the problem addressed in the paper. Previous work deal with this issue at the instance level: using an alignment score for each table-reference pair. In contrast, we propose a finer-grained approach, arguing that hallucinations should rather be treated at the word level. Specifically, we propose a Multi-Branch Decoder which is able to leverage word-level labels to learn the relevant parts of each training instance. These labels are obtained following a simple and efficient scoring procedure based on co-occurrence analysis and dependency parsing. Extensive evaluations, via automated metrics and human judgment on the standard WikiBio benchmark, show the accuracy of our alignment labels and the effectiveness of the proposed Multi-Branch Decoder. Our model is able to reduce and control hallucinations, while keeping fluency and coherence in generated texts. Further experiments on a degraded version of ToTTo show that our model could be successfully used on very noisy settings.

ChatGPT is bullshit

Article Open access 08 June 2024

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data-to-Text Generation (DTG) is the subfield of Computational Linguistics and Natural Language Generation (NLG) that is concerned with transcribing structured data into natural language descriptions, or, said otherwise, transcribing machine understandable information into a human understandable description (Gatt and Krahmer 2018). DTG objectives includes coverage, i.e. all the required information should be present in the text, and adequacy, i.e. the text should not contain information that is not covered by the input data. DTG is a domain distinct from other NLG task (e.g. machine translation (Wiseman et al. 2017), text summarization (Kryscinski et al. 2019)) with its own challenges (Wiseman et al. 2017), starting with the nature of inputs (Reiter and Dale 1997; Narayan and Gardent 2020). Such inputs include and are not limited to: databases of records, spreadsheets, knowledge bases, sensor readings. As an example, Fig. 1 shows an instance of the WikiBio dataset, i.e. a data table containing information about Kian Emadi, paired with its corresponding natural language description found on Wikipedia.

Early approaches to DTG relied on static rules hand-crafted by experts, in which content selection (what to say) and surface realization (how to say it) are typically two separate tasks (Reiter and Dale 1997; Ferreira et al. 2019). In recent years, neural models have blurred this distinction: various approaches showed that both content selection and surface realization can be learned in an end-to-end, data-driven fashion (Mei et al. 2016; Liu et al. 2019a; Roberti et al. 2019; Puduppully et al. 2019a). Based on the now-standard encoder-decoder architecture, with attention and copy mechanisms (Bahdanau et al. 2015; See et al. 2017), neural methods for DTG are able to produce fluent text conditioned on structured data in a number of domains (Lebret et al. 2016; Wiseman et al. 2017; Puduppully et al. 2019b), without relying on heavy manual work from field experts.

Such advances have gone hand in hand with the introduction of larger and more complex benchmarks. In particular, surface-realization abilities have been well studied on hand-crafted datasets such as E2E (Novikova et al. 2017b) and WebNLG (Gardent et al. 2017), while content-selection has been addressed by automatically constructed dataset such as WikiBio (Lebret et al. 2016) or RotoWire (Wiseman et al. 2017). These large corpora are often constructed from internet sources, which, while easy to access and aggregate, do not consist of perfectly aligned source-target pairs (Perez-Beltrachini and Gardent 2017; Dhingra et al. 2019). Consequently, model outputs are often subject to over-generation: misaligned fragments from training instances, namely divergences, can induce similarly misaligned outputs during inference, the so-called hallucinations.

In this paper, we specifically address the issue of hallucinations, which is currently regarded as a major issue in DTG (Narayan and Gardent 2020). Indeed, experimental surveys show that real-life end-users of DTG systems care more about reliability than about readability (Reiter and Belz 2009), as unfaithful texts can potentially mislead decision makers, with dire consequences. Hallucinations-reduction methods such as the one presented here have applications in a broad range of tasks requiring high reliability, like news reports (Leppänen et al. 2017), in which hallucinations may give rise to fake news, or summaries of patient information in clinical contexts (Portet et al. 2009; Banaee et al. 2013).

When corpora include a mild amount of noise, as in handcrafted ones (e.g. E2E, WebNLG), dataset regularization techniques (Nie et al. 2019; Dusek et al. 2019) or hand crafted rules (Juraska et al. 2018) can help to reduce hallucinations. Unfortunately, these techniques are not suited to more realistic and noisier datasets, as for instance WikiBio (Lebret et al. 2016) or RotoWire (Wiseman et al. 2017). On these benchmarks, several techniques have been proposed, such as reconstruction loss terms (Wiseman et al. 2017; Wang 2019; Lin et al. 2020) or Reinforcement Learning (RL) based methods (Perez-Beltrachini and Lapata 2018; Liu et al. 2019b; Rebuffel et al. 2020). These approaches suffer however from different issues: (1) the reconstruction loss relies on the hypothesis of one-to-one alignment between source and target which does not fit with content selection in DTG; (2) RL-trained models are based on instance-level rewards (e.g. BLEU (Papineni et al. 2002), PARENT (Dhingra et al. 2019)) which can lead to a loss of signal because divergences occur at the word level. In practice, parts of the target sentence express source attributes (in Fig. 1 name and occupation fields are correctly realized), while others diverge (the birthday and nationality of Kian Emadi are not supported by the source table).

Interestingly, one can view DTG models as Controlled Text Generation (CTG) ones focused on controlling content, as most CTG techniques condition the generation on several key-value pairs of control factors (e.g. tone, tense, length) (Dong et al. 2017; Hu et al. 2017; Ficler and Goldberg 2017). Recently, Filippova (2020) explicitly introduced CTG to DTG by leveraging an hallucination score simply attached as an additional attribute which reflects the amount of noise in the instance. As an example, the table from Fig. 1 can be augmented with an additional line (hallucination_score, 80%)^{Footnote 1}. However, this approach requires a strict alignment at the instance-level, namely between control factors and target text. A first attempt towards word-level approaches is proposed by Perez-Beltrachini and Lapata (2018) (also PB&L in the following). They design word-level alignment labels, denoting the correspondence between the text and the input table, to bootstrap DTG systems. However, they incorporate these labels into a sentence-level RL-reward, which ultimately leads to a loss of this finer-grained signal.

In this paper, we go further in this direction with a DTG model by fully leveraging word-level alignment labels with a CTG perspective. We propose an original approach in which the word-level is integrated at all phases:

we propose a word-level labeling procedure (Sect. 3), based on co-occurrences and sentence structure through dependency parsing. This mitigates the failure of strict word-matching procedure, while still producing relevant labels in complex settings.
we introduce a weighted multi-branch neural decoder(Sect. 4), guided by the proposed alignment labels, acting as word-level control factors. During training, the model is able to distinguish between aligned and unaligned words and learns to generate accurate descriptions without being misled by un-factual reference information. Furthermore, our multi-branch weighting approach enables control at inference time.

We carry out extensive experiments on WikiBio, to evaluate both our labeling procedure and our decoder (Sect. 6). We also test our framework on ToTTo (Parikh et al. 2020), in which models are trained with noisy reference texts, and evaluated on references reviewed and cleaned by human annotators to ensure accuracy. Evaluations are based on a range of automated metrics as well as human judgments, and show increased performances regarding hallucinations reduction, while preserving fluency.

Importantly, our approach makes training neural models on noisy datasets possible, without the need to handcraft instances. This work shows the benefit of word-level techniques, which leverage the entire training set, instead of removing problematic training samples, which may form the great majority of the available data.

2 Related work

Handling hallucinations in noisy datasets The use of Deep Learning based methods to solve DTG tasks has led to sudden improvements in state of the art performances (Lebret et al. 2016; Wiseman et al. 2017; Liu et al. 2018; Puduppully et al. 2019a). As a key aspect in determining a model’s performance is the quality of training data, several large corpora have been introduced to train and evaluate models’ abilities on diverse tasks. E2E (Novikova et al. 2017b) evaluates surface realization, i.e. the strict transcription of input attributes into natural language; RotoWire (Wiseman et al. 2017) pairs statistics of basketball games with their journalistic descriptions, while WikiBio (Lebret et al. 2016) maps a Wikipedia info-box with the first paragraph of its associated article. Contrary to E2E, the latter datasets are not limited to surface realization. They were not constructed by human annotators, but rather created from Internet sources, and consist of loosely aligned table-reference pairs: in WikiBio, almost two thirds of the training instances contain divergences (Dhingra et al. 2019), and no instance has a 1-to-1 source-target alignment (Perez-Beltrachini and Gardent 2017).

On datasets with a moderate amount of noise, such as E2E, data pre-processing has proven effective for reducing hallucinations. Indeed, rule-based (Dusek et al. 2019) or neural-based methods (Nie et al. 2019) have been proposed, specifically with table regularization techniques, where attributes are added or removed to re-align table and target description. Several successful attempts have also been made in automatically learning alignments between the source tables and reference texts, benefiting from the regularity of the examples (Juraska et al. 2018; Shen et al. 2020; Gehrmann et al. 2018). For instance, Juraska et al. (2018) leverage templating and hand-crafted rules to re-rank the top outputs of a model decoding via beam search; Gehrmann et al. (2018) also leverage the possible templating formats of E2E’s reference texts, and train an ensemble of decoders where each decoder is associated to one template; and Kasner and Dusek (2020) produce template-based lexicalizations and improve them via a sentence fusion model. The previous techniques are not applicable in more complex, general settings. The work of Dusek et al. (2019) hints at this direction, as authors found that neural models trained on E2E were principally prone to omissions rather than hallucinations. In this direction, Shen et al. (2020) were able to obtain good results at increasing the coverage of neural outputs, by constraining the decoder to focus its attention exclusively on each table cell sequentially until the whole table was realized. On more complex datasets (e.g. WikiBio), a wide range of methods has been explored to deal with factualness such as loss design, either with a reconstruction term (Wiseman et al. 2017; Wang 2019) or with RL-based methods (Perez-Beltrachini and Lapata 2018; Liu et al. 2019b; Rebuffel et al. 2020). Similarly to the coverage constraints, a reconstruction loss has proven only marginally efficient in these settings, as it contradicts the content selection task (Wang 2019), and needs to be well calibrated using expert insight in order to bring improvements. Regarding RL, Perez-Beltrachini and Lapata (2018) build an instance-level reward which sums up word-level scores; Liu et al. (2019b) propose a reward based on document frequency to favor words from the source table more than rare words; and Rebuffel et al. (2020) train a network with a variant of PARENT (Dhingra et al. 2019) using self-critical RL. Note that data regularization techniques have also been proposed (Thomson et al. 2020; Wang 2019), but these methods require heavy manual work and expert insights, and are not readily transposable from one domain to another.

From CTG to controlling hallucinations Controlled Text Generation (CTG) is concerned with constraining a language model’s output during inference on a number of desired attributes, or control factors, such as the identity of the speaker in a dialog setting (Li et al. 2016), the politeness of the generated text or the text length in machine-translation (Sennrich et al. 2016; Kikuchi et al. 2016), or the tense in generated movie reviews (Hu et al. 2017). Earlier attempts at neural CTG can even be seen as direct instances of DTG as it is currently defined: models are trained to generate text conditioned on attributes of interest, where attributes are key-value pairs. For instance, in the movie review domain, Ficler and Goldberg (2017) proposed an expertly crafted dataset, where sentences are strictly aligned with control factors, being either content or linguistic style aspects (e.g. tone, length).

In the context of dealing with hallucinations in DTG, Filippova (2020) recently proposed a similar framework, by augmenting source tables with an additional attribute that reflects the degree of hallucinated content in the associated target description. During inference, this attribute acts as an hallucination handle used to produce more or less factual text. As mentioned in Sect. 1, we argue that a unique value can not accurately represent the correspondence between a table and its description, due to the phrase-based nature of divergences.

Based on the literature review, the lack of model control can be evidenced when loss modification methods are used (Wang 2019; Liu et al. 2019a; Rebuffel et al. 2020), although these approaches can be efficient and transposed from one domain to another. On the other hand, while CTG deals with control and enables choosing the defining features of generated texts (Filippova 2020), standard approaches rely on instance-level control factors that do not fit with hallucinations, which rather appear due to divergences at the word level. Our approach aims at gathering the merits of both trends of models and is guided by previous statements highlighting that word-level is primary in hallucination control. More particularly, our model differs from previous ones in several aspects:

(1)
Contrasting with data-driven approaches (i.e. dataset regularization) which are costly in expert time, and loss-driven approaches (i.e. reconstruction or RL losses) which often do not take into account key subtasks of DTG (content-selection, world-level correspondences), we propose a multi-branch modeling procedure which allows the controllability of the hallucination factor in DTG. This multi-branch model can be integrated seamlessly in current approaches, allowing to keep peculiarities of existing DTG models, while deferring hallucination management to a parallel decoding branch.
(2)
Unlike previous CTG approaches (Li et al. 2016; Sennrich et al. 2016; Ficler and Goldberg 2017; Filippova 2020) which propose instance-level control factors, the control of the hallucination factor is performed at the word-level to enable finer-grained signal to be sent to the model.

Our model is composed of two main components: (1) a word-level alignment labeling mechanism, which makes the correspondence between the input table and the text explicit, and (2) a multi-branch decoder guided by these alignment labels. The branches separately integrate co-dependent control factors (namely content, hallucination and fluency). We describe these components in Sects. 3 and 4, respectively.

3 Word-level alignment labels

We consider a DTG task, in which the corpus $\mathcal {C}$ is composed of a set of entity-description pairs, (e, y). A single-entity table e is a variable-sized set of $T_e$ key-value pairs $x {:}{=}(k, v)$. A description $y {:}{=}y_{1:T_y}$ is a sequence of $T_y$ tokens representing the natural language description of the entity; we refer to the tokens spanning from indices t to $t'$ of a description y as $y_{t:t'}$. A description is made of statements, defined as text spans expressing one single idea (“Appendix A” presents in detail the statement partitioning procedure). We refer to the first index of a statement as $t_i$, so that $y_{t_i:t_{i+1}-1}$ is the $i^{th}$ statement itself. Figure 1 shows a WikiBio entity made by 8 key-value pairs together with its associated description.

First, we aim at labeling each word from a description, depending on the presence of a correspondence with its associated table. We call such labels alignment labels. We drive the word-level labeling procedure on two intuitive constraints: (1) important words (names, adjectives and numbers) should be labeled depending on their alignment with the table, and (2) words from the same statement should have the same label.

With this in mind, the alignment label for the $t^{\text {th}}$ token $y_t$ is a binary label: $l_t {:}{=}\mathbbm {1}_{\{s_t > \tau \}}$ where $s_t$ refers to the alignment score between $y_t$ and the table, and $\tau $ is set experimentally (see Sect. 5.3). The alignment score $s_t$ acts as a normalized measure of correspondence between a token $y_t$ and the table e:

$$\begin{aligned} s_t \,{:}{=}\, norm (\max _{x \in e} align (y_t, x), ~~y) \end{aligned}$$

(1)

where the function align estimates the alignment between token $y_t$ and a key-value pair x from the input table e, and norm is a normalization function based on the dependency structure of the description y. Figure 2 illustrates our approach: under each word we show its word alignment score, and words are colored in red if this score is lower than $\tau $, denoting an alignment label equal to 0. Below, we describe these functions (“Appendix A” contains reproducibility details).

Co-occurrence-based alignment function ($\mathbf {align}(\cdot , \mathbf{x})$). This function assigns to important words a score in the interval [0, 1] proportional to their co-occurrence count (a proxy for alignment) with the key-value pair from the input table. If the word $y_t$ appears in the key-value pair $x {:}{=}(k,v)$, $align(y_t, x)$ outputs 1; otherwise, the output is obtained scaling the number of occurrences $co_{y_t,x}$ between $y_t$ and x through the dataset:

$$\begin{aligned} align(y_t,x) {:}{=}{\left\{ \begin{array}{ll} 1 &{} \text {if~~} y_t \in x\\ a \cdot (co_{y_t,x}\!- m)^2&{} \text {if~~} m \le co_{y_t,x}\!\le M\\ 0&{} \text {if~~} 0 \le co_{y_t,x}\!\le m \end{array}\right. } \end{aligned}$$

(2)

where M is the maximum number of word co-occurrences in the dataset vocabulary and the row x, m is a threshold value, and $a {:}{=}\frac{1}{(M-m)^2}$.

Score normalization ($\mathbf {norm(\cdot , y)}$). According to the already stated assumption (2)—words inside the same statement should have the same score – , we first split the sentence y into statements $y_{t_i:t_{i+1}-1}$, via dependency parsing and its rule-based conversion to constituency trees (Han et al. 2000; Xia and Palmer 2001; Hwa et al. 2005; Borensztajn et al. 2009). Given a word $y_t$ associated to the score $s_t$ and belonging to statement $y_{t_i:t_{i+1}-1}$, its normalized score corresponds to the average score of all important words in this statement:

$$\begin{aligned} norm(s_t, y) = \frac{1}{t_{i+1}-t_i} \sum _{j=t_i}^{t_{i+1}-1} s_j \end{aligned}$$

(3)

This in-statement average depends on both the specific word and its context, leading to coherent hallucination scores which can be thresholded without affecting the syntactical sentence structure, as shown in Fig. 2.

4 Multi-branch architecture

The proposed Multi-Branch Decoder (MBD) architecture aims at separating targeted co-dependent factors during generation. We build upon the standard DTG architecture, an encoder-decoder with attention and copy mechanism, which we modify by duplicating the decoder module into three distinct parallel modules. Each control factor (i.e. content, hallucination or fluency) is modeled via a single decoding module, also called branch, whose output representation can be weighted according to its desired importance. At training time, weights change depending on the word currently being decoded, inducing the desired specialization of each branch. During inference, weights are manually set, according to the desired trade-off between information reliability, sentence diversity and global fluency. Text generation is thus controllable, and consistent with the control factors.

Figure 3 illustrates a training step over the sentence “Giuseppe Mariani was an Italian art director”, in which Italian is a divergent statement (i.e. is not supported by the source table). While decoding factual words, the weight associated to the content (resp. hallucination) branch is set to 0.5 (resp. 0) while during the decoding of Italian, the weight associated to the content (resp. hallucination) branch is set to 0 (resp. 0.5). Note that the weight associated to the fluency branch is always set to 0.5, as fluency does not depend on factualness.

The decoding modules’ actual architecture may vary, as we framed the MBD model from a high level perspective. Therefore, all types of decoder can be used, such as Recurrent Neural Networks (RNNs) (Rumelhart et al. 1986), Transformers (Vaswani et al. 2017), and Convolutional Neural Networks (Gehring et al. 2017). The framework can be generalized to different merging strategies as well, such as late fusion, in which the final distributions are merged, instead of the presented early fusion, which works at the decoder states level.

In this paper, experiments are carried out on RNN-based decoders, weighting their hidden states. Sect. 4.1 presents the standard DTG encoder-decoder architecture; Sect. 4.2 shows how it can be extended to MBD, together with its peculiarities and the underlying objectives and assumptions.

4.1 Standard DTG architecture

Neural DTG approaches typically use an encoder-decoder architecture (Wiseman et al. 2017) in which (1) the encoder relies on a RNN to encode each element of the source table into a fixed-size latent representation $h_j$ (elements of the input table are first embedded into $T_e$ N-dimensional vectors, and then fed sequentially to the RNN (Wiseman et al. 2017)), and (2) the decoder generates a textual description y using a RNN augmented with attention and copy mechanisms (See et al. 2017). Words are generated in an auto-regressive way. The decoder’s RNN updates its hidden state $d_t$ as:

$$\begin{aligned} d_t \,{:}{=}\, \text {RNN}(d_{t-1}, [y_{t-1}, c_t]) \end{aligned}$$

(4)

where $y_{t-1}$ is the previous word and $c_t$ is the context vector obtained through the attention mechanism. Finally, a word is drawn from the distribution computed via a copy mechanism (See et al. 2017).

4.2 Controlling hallucinations via a multi-branch model

Our objective is to enrich the decoder in order to be able to tune the content/hallucination ratio during generation, aiming at enabling generation of hallucination-free text when needed. Our key assumption is that the decoder’s generation is conditioned by three co-dependent factors:

Content factor constrains the generation to realize only the information included in the input;
Hallucinating factor favors lexically richer and more diverse text, but may lead to hallucinations not grounded by the input;
Fluency factor^{Footnote 2} conditions the generated sentences toward global syntactic correctness, regardless of the relevance.

Based on this assumption, we propose a multi-branch encoder-decoder network, whose branches are constrained on the above factors at word-level, as illustrated in Fig. 3. Our network has a single encoder and $F=3$ distinct decoding RNNs, noted $\text {RNN}^f$ respectively, one for each factor. During each decoding step, the previously decoded word $y_{t-1}$ is fed to all RNNs, and a final decoder state $d_t$ is computed using a weighted sum of all the corresponding hidden states,

$$\begin{aligned} d_t^f&{:}{=}&\,\text {RNN}^f(d_{t-1}^f, [y_{t-1}, c_t]) \end{aligned}$$

(5)

$$\begin{aligned} d_t&{:}{=}&\, \sum _{f=1}^F \omega _t^f d^f_t \end{aligned}$$

(6)

where $d_t^f$ and $\omega _t^f$ are respectively the hidden state and the weight of the $f^{th}$ RNN at time t.

Weights are used to constrain the decoder branches to the desired control factors ($\omega _t^0,\omega _t^1,\omega _t^2$ for the content, hallucination and fluency factors respectively) and sum to one.

During training, their values are dynamically set depending on the alignment label $l_t\in \{0,1\}$ of the target token $y_t$ (see Sect. 5.3). While a number of mappings can be used to set the weights given the alignment label, early experiments have shown that better results were achieved when using a binary switch for each factor, i.e. activating/deactivating each branch, as shown in Fig. 3 (note that fluency should not depend on content and therefore its associated branch is always active).

During inference, the weights of the decoder’s branches are set manually by a user, according to the desired trade-off between information reliability, sentence diversity and global fluency. Text generation is then controllable and consistent with the control factors.

5 Experimental setup

5.1 Datasets

We evaluated the model on two representative large size datasets. Both have been collected automatically and present a significant amount of table-text divergences for training. Both datasets involve content selection and surface realization, and represent a relatively realistic setting.

WikiBio (Lebret et al. 2016) contains 728, 321 tables, automatically paired with the first sentence of the corresponding Wikipedia English article. Reference text’s average length is 26 words, and tables have on average 12 key-value pairs. We use the original data partition: $80\%$ for the train set, and $10\%$ for validation and test sets. This dataset has been automatically built from the Internet; concerning divergences, $62\%$ of the references mention extra information not grounded by the table (Dhingra et al. 2019).

ToTTo (Parikh et al. 2020) contains 120, 761 training examples, and 7, 700 validation and test examples. For a given Wikipedia page, an example is built up by pairing its summary table and a candidate sentence, selected across the whole page via simple similarity heuristics. Such a sentence may accordingly realize whichever table cells, making content selection arbitrary; furthermore, its lexical form may strongly depend on the original context, because of pronouns or anaphoras. Divergences are of course present as well. Those issues have been addressed by Parikh et al. (2020) by (1) highlighting the input cells realized by the output, and (2) removing divergences and making the sentence self-contained (e.g. replacing pronouns with their invoked noun or noun phrase). Figure 6 exemplifies the difference between noisy and clean ToTTo sentences. In our experiments, we limit the input to the highlighted cells and use the original, noisy sentence as output. Noisy texts’ average length is 17.4 words, and 3.55 table cells are highlighted, on average.

5.2 Baselines

We assess the accuracy and relevance of our alignment labels against the ones proposed by Perez-Beltrachini and Lapata (2018), which is, to the best of our knowledge, the only work proposing such a fine-grained alignment labeling.

To evaluate our Multi-Branch Decoder (MBD), we consider five baselines:

stnd (See et al. 2017), a LSTM-based encoder-decoder model with attention and copy mechanisms. This is the standard sequence-to-sequence recurrent architecture.
stnd_filtered, the previous model trained on a filtered version of the training set: tokens deemed hallucinated according to their hallucination scores, are removed from target sentences.
hsmm (Wiseman et al. 2018), an encoder-decoder model with a multi-branch decoder. The branches are not constrained by explicit control factors. This is used as a baseline to show that the multi-branch architecture by itself does not guarantee the absence of hallucinations.
hier (Liu et al. 2019a), a hierarchical sequence-to-sequence model, with a coarse-to-fine attention mechanism to better fit the attribute-value structure of the tables. This model is trained with three auxiliary tasks to capture more accurate semantic representations of the tables.
${hal_{WO}}$ (Filippova 2020), a stnd-like model trained by augmenting each source table with an additional attribute (hallucination ratio, value).

We ran our own implementations of stnd, stnd_filtered and ${hal_{WO}}$. Authors of hier and hsmm models kindly provided us their WikiBio’s test set outputs. The metrics described in Sect. 5.4 were directly applied on them.

5.3 Implementation details

During training of our multi-branch decoder the fluency branch is always active ($\omega _t^2 = 0.5$) while the content and hallucination branches are alternatively activated, depending on the alignment label $l_t$: $\omega _t^0 = 0.5$ (content factor) and $\omega _t^1 = 0$ (hallucination factor) when $l_t = 1$, and conversely. The threshold $\tau $ used to obtain $l_t$ is set to 0.4 using human tuning to optimize for highest accuracy.^{Footnote 3} All hyperparameters were tuned in order to optimize the validation PARENT F-measure (Dhingra et al. 2019). In particular, we use the [0.4 0.1 0.5] weight combination during inference. See Sect. 6.2 for a discussion about weight combinations and “Appendix B” for other implementation details.^{Footnote 4}

5.4 Metrics

To evaluate our model, we carried out (1) an automatic analysis and (2) a human evaluation for a qualitative analysis of generated sentences.

For the automatic analysis, we use five metrics:

BLEU (Papineni et al. 2002) is a length-penalized precision score over n-grams, $n \in \llbracket 1, 4 \rrbracket $, optionally improved with a smoothing technique (Chen and Cherry 2014). Despite being the standard choice, recent findings show that it correlates poorly with human evaluation, especially on the sentence level (Novikova et al. 2017a; Reiter 2018), and that it is a proxy for sentence grammar and fluency aspects rather than semantics (Dhingra et al. 2019).
PARENT (Dhingra et al. 2019) computes smoothed n-gram precision and recall over both the reference and the input table. It is explicitly designed for DTG tasks, and its F-measure shows “the highest correlation with humans across a range of settings with divergent references in WikiBio.” (Dhingra et al. 2019)
The hallucination rate computes the percentage of tokens labeled as hallucinations (Sect. 3).
The average generated sentence length in number of words.
The classic readability Flesch index (Flesch 1962), which is based on words per sentence and syllables per word, and is still used as a standard benchmark (Kosmajac and Keselj 2019; Smeuninx et al. 2020; Stajner and Hulpus 2020; Stajner et al. 2020).

Finally, we perform qualitative evaluations of the results obtained on WikiBIO and ToTTo, following the best practices outlined by van der Lee et al. (2019). Our human annotators are from several countries across Europe, between 20 and 55 years old and proficient in English. They have been assigned two different tasks: (i) hallucination labeling, i.e. the selection of sentence pieces which include incorrect information, and (ii) sentence analysis, i.e. evaluating different realizations of the same table according to their fluency, factualness and coverage. Scores are presented as a 3-level Likert scale for Fluency (Fluent, Mostly fluent, or Not fluent) and Factualness (likewise), while coverage is the number of cells from the table that have been realized in the description.

To avoid all bias, annotators are shown a randomly selected table at a time, together with its corresponding descriptions, both from the dataset and the models that are being evaluated. Sentences are presented each time in a different order. Following Tian et al. (2019), we first tasked three expert annotators to annotate a pilot batch of 50 sentences. Once confirmed that Inter-Annotator Agreement was approx. 75% (a similar finding to Tian et al. (2019)), we asked 16 annotators to annotate a bigger sample of 300 instances (where each instance consists of one table and four associated outputs), as Liu et al. (2019a).^{Footnote 5}

6 Results

We perform an extensive evaluation of our scoring procedure and multi-branch architecture on the WikiBio dataset: we evaluate—the quality of the proposed alignment labels, both intrinsically using human judgment and extrinsically by means of the DTG downstream task and—the performance of our model with respect to the baselines. Additionally, we assess the applicability of our framework on the more noisy ToTTo benchmark, which represents a harder challenge for today’s DTG models.

Table 1 Performances of hallucination scores on WikiBio test set, w.r.t. human-designated labels (upper table) and MBD trained with different labeling procedures (lower table). Our model always significantly overpasses PB&L (T-test with $p < 0.005$)

Controlling hallucinations at word level in data-to-text generation

Abstract

Similar content being viewed by others

ChatGPT is bullshit

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

1 Introduction

2 Related work

3 Word-level alignment labels

4 Multi-branch architecture

4.1 Standard DTG architecture

4.2 Controlling hallucinations via a multi-branch model

5 Experimental setup

5.1 Datasets

5.2 Baselines

5.3 Implementation details

5.4 Metrics

6 Results

6.1 Validation of alignment labels

6.2 Automatic system evaluation

6.3 Human evaluation

6.4 ToTTo: a considerably noisy setting

7 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Appendices

A Alignment labels reproducibility

B Implementation details

C Annotation interface

D Qualitative examples

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation