Preference-based interactive multi-document summarisation

Interactive NLP is a promising paradigm to close the gap between automatic NLP systems and the human upper bound. Preference-based interactive learning has been successfully applied, but the existing methods require several thousand interaction rounds even in simulations with perfect user feedback. In this paper, we study preference-based interactive summarisation. To reduce the number of interaction rounds, we propose the Active Preference-based ReInforcement Learning (APRIL) framework. APRIL uses active learning to query the user, preference learning to learn a summary ranking function from the preferences, and neural Reinforcement learning to efficiently search for the (near-)optimal summary. Our results show that users can easily provide reliable preferences over summaries and that APRIL outperforms the state-of-the-art preference-based interactive method in both simulation and real-user experiments.


Introduction
Interactive Natural Language Processing (NLP) approaches that put the human in the loop gained increasing research interests recently (Amershi et al. 2014;Gurevych et al. 2018;Kreutzer et al. 2018a).The user-system interaction enables personalised and user-adapted results by incrementally refining the underlying model based on a user's behaviour and by optimising the learning through actively querying for feedback and judgements.Interactive methods can start with no or only few input data and adjust the output to the needs of human users.
Previous research has explored eliciting different forms of feedback from users in interactive NLP, for example mouse clicks for information retrieval (Borisov et al. 2018), post-edits and ratings for machine translation (Denkowski et al. 2014;Kreutzer et al. 2018a), error markings for semantic parsing (Lawrence and Riezler 2018), bigrams for summarisation (P.V.S. and Meyer 2017), and preferences for translation (Kreutzer et al. 2018b).Controlled experiments suggest that asking for preferences places a lower cognitive burden on the human subjects than asking for absolute ratings or categorised labels (Thurstone 1927;Kendall 1948;Kingsley and Brown 2010).But it remains unclear whether people can easily provide reliable preferences over summaries.In addition, preference-based interactive NLP faces the high sample complexity problem: a preference is a binary decision and hence only contains a single bit of information, so the NLP systems usually need to elicit a large number of preferences from the users to improve their performance.For example, the machine translation system by Sokolov et al. (2016a) needs to collect hundreds of thousands of preferences from a simulated user before it converges.
Collecting such large amounts of user inputs and using them to train a "one-fits-all" model might be feasible for tasks such as machine translation, because the learnt model can generalise to many unseen texts.However, for highly subjective tasks, such as document summarisation, this procedure is not effective, since the notion of importance is specific to a certain topic or user.For example, the information that Lee Harvey Oswald shot president Kennedy might be important when summarising the assassination, but less important for a summary on Kennedy's childhood.Likewise, a user who is not familiar with the assassination might consider the information more important than a user who is analysing the political backgrounds for many years.Therefore, we aim at an interactive system that adapts a model for a given topic and user context based on user feedback -instead of training a single model across all users and topics, which hardly fits anyone's needs perfectly.In this scenario, it is essential to overcome the high sample complexity problem and learn to adapt the model using a minimum of user interaction.
In this article, we propose the Active Preference-based ReInforcement Learning (APRIL) framework 1 .Our core research idea is to split the preferencebased interactive learning process into two stages.First, we estimate the user's ranking over candidate summaries using active preference learning (APL) in an interaction loop.Second, we use the learnt ranking to guide a neural reinforcement learning (RL) agent to search for the (near-)optimal summary.The use of APL allows us to maximise the information gain from a small number of preferences, helping to reduce the sample complexity.Fig. 1 shows this general idea in comparison to the state-of-the-art preference-based interactive NLP paradigm, Structured Prediction from Partial Information (SPPI) (Sokolov et al. 2016b;Kreutzer et al. 2017).In §3, we discuss the technical background of RL, preference learning and SPPI, before we introduce our solution APRIL in §4.
We apply APRIL to the Extractive Multi-Document Summarisation (EMDS) task.Given a cluster of documents on the same topic, an EMDS system needs to extract important sentences from the input documents to generate a summary complying with a given length requirement that fits the needs of the user and her/his task.For the first time, we provide evidence for the efficacy of preference-based interaction in EMDS based on a user study, in which we measure the usability and the noise of preference feedback, yielding a mathematical model we can use for simulation and for analysing our results ( §5).
To evaluate APRIL, we then perform experiments on standard EMDS benchmark datasets.We compare the effectiveness of multiple APL and RL algorithms and select the best algorithms for our full system.We compare APRIL to SPPI and non-interactive methods, in both simulation ( §6) and real-user experiments ( §7).Our results suggest that with only ten rounds of user interaction, APRIL produces summaries better than those produced by both non-interactive methods and SPPI.
This work extends our earlier work (Gao et al. 2018) in three aspects.(i) We present a new user study on the reliability and usability of the preferencebased interaction ( §5).Based on this study, we propose a realistic simulated user, which is used in our experiments.(ii) We evaluate multiple new APL strategies and a novel neural RL algorithm, and compare them with the counterpart methods used in Gao et al. (2018).The use of these new algorithms further boost the efficiency and performance of APRIL ( §6).(iii) We conduct additional user studies to compare APRIL with both non-interactive baselines and SPPI under more realistic settings ( §7).APRIL can be applied to a wide range of other NLP tasks, including machine translation, semantic parsing and information exploration.All source code and experimental setups can be found in https://github.com/UKPLab/irj-neural-april.
2 Related Work SPPI.The method most similar to ours is SPPI (Sokolov et al. 2016b;Kreutzer et al. 2017).The core of SPPI is a policy-gradient RL algorithm, which receives rewards derived from the preference-based feedback.It maintains a policy that approximates the utility of each candidate output and selects the higher-utility candidates with higher probability.As discussed in §1, SPPI suffers heavily from the high sample complexity problem.We will present the technical details of SPPI in §3.3 and compare it to APRIL in §6 and §7.
Preferences.The use of preference-based feedback in NLP attracts increasing research interest.Zopf (2018) learns a sentence ranker from human preferences on sentence pairs, which can be used to evaluate the quality of summaries, by counting how many high-ranked sentences are included in a summary.Simpson and Gurevych (2018) develop an improved Gaussian process preference learning (Chu and Ghahramani 2005) algorithm to learn an argument convincingness ranker from noisy preferences.Unlike these methods that focus on learning a ranker from preferences, we focus on using preferences to generate better summaries.Kreutzer et al. (2018b) ask real users to provide cardinal (5-point ratings) and ordinal (pairwise preferences) feedback over translations, and use the collected data to train an off-policy RL to improve the translation quality.Their study suggests that the inter-rater agreement for the cardinal and ordinal feedback is similar.However, they do not measure or consider the influence of the questions' difficulties on the agreement, which we find significant for EMDS (see §5).In addition, their system is not interactive, but uses log data instead of actively querying users.
Interactive Summarisation.The iNeATS (Leuski et al. 2003) and IDS (Jone et al. 2002) systems allow users to tune several parameters (e.g., size, redun-dancy, focus) to customise the produced summaries.Further work presents automatically derived summary templates (Orǎsan et al. 2003;Orǎsan and Hasler 2006) or hierarchically ordered summaries (Christensen et al. 2014;Shapira et al. 2017) allowing users to drill-down from a general overview to detailed information.However, these systems do not employ the users' feedback to update their internal summarisation models.P.V.S. and Meyer (2017) propose an interactive EMDS system that asks users to label important bigrams within candidate summaries.Given the important bigrams, they use integer linear programming to optimise important bigram coverage in the summary.In simulation experiments, their system can achieve near-optimal performance in ten rounds of interaction, collecting up to 350 important bigrams.However, labelling important bigrams is a large burden on the users, as users have to read through many potentially unimportant bigrams (see §5).Also, they assume that the users' feedback is always perfect.
Reinforcement Learning.RL has been applied to both extractive and abstractive summarisation in recent years (Ryang and Abekawa 2012;Rioux et al. 2014;Gkatzia et al. 2014;Henß et al. 2015;Paulus et al. 2017;Pasunuru and Bansal 2018;Kryscinski et al. 2018).Most existing RL-based document summarisation systems either use heuristic functions (e.g., Ryang and Abekawa 2012;Rioux et al. 2014), which do not rely on reference summaries, or ROUGE scores requiring reference summaries as the rewards for RL (Paulus et al. 2017;Pasunuru and Bansal 2018;Kryscinski et al. 2018).However, neither ROUGE nor the heuristics-based rewards can precisely reflect real users' requirements on summaries (Chaganty et al. 2018); hence, using these imprecise rewards can severely mislead the RL-based summariser.The quality of the rewards has been recognised as the bottleneck for RL-based summarisation systems (Kryscinski et al. 2018).Our work learns how to give good rewards from users' preferences.In this work, we assume that our system has no access to the reference summaries, but can query a user for preferences over summary pairs.Some RL work directly uses the users' ratings as rewards.Nguyen et al. (2017) employ user ratings on translations as rewards when training an RLbased encoder-decoder translator.However, eliciting ratings on summaries is very expensive as users have high variance in their ratings of the same summary (Chaganty et al. 2018), which is why we consider preference-based feedback and a learnt reward surrogate.
Preference-based RL (PbRL) is a recently proposed paradigm at the intersection of preference learning, RL, active learning (AL) and inverse RL (Wirth et al. 2017).Unlike apprenticeship learning (Dethlefs and Cuayáhuitl 2011) which requires the user to demonstrate (near-)optimal sequences of actions (called action trajectories), PbRL only asks for the user's preferences (either partial or total order) on several action trajectories.Wirth et al. (2016) apply PbRL to several simulated robotics tasks.They show that their method can achieve near-optimal performance by interacting with a simulated perfect user for 15-40 rounds.Christiano et al. (2017)

Notation Description
x ∈ X a document cluster x from the set of all possible inputs X y ∈ Yx ⊆ S a summary y from the set of all legal summaries Yx for x ∈ X Mx = (S, A, P, R, T ) MDP of the EMDS task for x ∈ X : states S, actions A, transition function P , reward function R and terminal states T ⊆ S R(y) the reward of summary y in Mx π(y) policy in RL: the probability of selecting summary y in Mx π( y i , y j ; w) policy in SPPI, parameterised by w: the probability of presenting pair y i , y j to the oracle (Eq.( 6))  3)) R SPPI x the objective function in SPPI (Eq.( 5)) Table 1: Overview of the notation used in this article robotics tasks, Atari-playing agents and a simulated back-flipping agent by collecting feedback from both simulated oracles and real crowdsourcing workers.They find that human feedback can be noisy and partial (i.e., capturing only a fraction of the true reward), but that it is much easier for people to provide consistent comparisons than consistent absolute scores in their robotics use case.In §5, we evaluate this for document summarisation.
However, the approach by Christiano et al. (2017) fails to obtain satisfactory results in some robotics tasks even after 5,000 interaction rounds.In a follow-up work, Ibarz et al. (2018) elicit demonstrations from experts, use the demonstrations to pre-train a model with imitation learning techniques, and successfully fine-tune the pre-trained model with PbRL.In EMDS, extractive reference summaries might be viewed as demonstrations, but they are expensive to collect and not available in popular summarisation corpora (e.g., the DUC datasets).APRIL does not require demonstrations, but learns a reward function based on user preferences on entire summaries, which is then used to train an RL policy.

Background
In this section, we recap necessary details of RL ( §3.1), preference learning ( §3.2) and SPPI ( §3.3).We adapt them to the EMDS use case, so as to lay the foundation for APRIL.To ease the reading, we summarise the notation used in the remaining article in Table 1.

Reinforcement Learning
RL amounts to algorithms for efficiently searching optimal solutions in Markov Decision Processes (MDPs).MDPs are widely used to formulate sequential decision-making problems.Let X be the input space and let Y x be the set of all possible outputs for input x ∈ X .An episodic MDP is a tuple M x = (S, A, P, R, T ) for input x ∈ X , where S is the set of states, A is the set of actions and P : S × A → S is the transition function with P (s, a) giving the next state after performing action a in state s.R : S × A → R is the reward function with R(s, a) giving the immediate reward for performing action a in state s.T ⊆ S is the set of terminal states; visiting a terminal state terminates the current episode.
EMDS can be formulated as episodic MDP, as the summariser has to sequentially select sentences from the original documents to add to the draft summary.Our MDP formulation of EMDS matches previous approaches by Ryang and Abekawa (2012) and Rioux et al. (2014): x ∈ X is a cluster of documents and Y x is the set of all legal summaries for cluster x (i.e., all permutations of sentences in x that fulfil the given summary length constraint).In the MDP M x for document cluster x ∈ X , S includes all possible draft summaries of any length (i.e., Y x ⊆ S).The action set A includes two types of actions: concatenate a sentence in x to the current draft summary, or terminate the draft summary construction.The transition function P is trivial in EMDS, because given the current draft summary and an action, the next state can be easily identified as the draft summary plus the selected sentence or as a terminating state.The reward function R returns an evaluation score of the summary once the action terminate is performed; otherwise it returns 0 because the summary is still under construction and thus not ready to be evaluated (so-called delayed rewards).Providing non-zero rewards before the action terminate can lead to even worse result, as reported by Rioux et al. (2014).The terminal states set T includes all states corresponding to summaries exceeding the given length requirement and an absorbing state s T .By performing action terminate, the agent will be transited to s T regardless of its current state, i.e.P (s, a) = s T for all s ∈ S if a is terminate.
A policy π : S × A → R in an MDP M x defines how actions are selected: π(s, a) is the probability of selecting action a in state s.Note that in many sequential decision-making tasks, π is learnt across all inputs x ∈ X .However, for our EMDS use case, we learn an input-specific policy for a given x ∈ X in order to reflect the subjectivity of the summarisation task introduced in §1.We let Y π be the set of all possible summaries a policy π can construct in document cluster x. π(y) denotes the probability of policy π for generating a summary y in x.Likewise, R(y) denotes the accumulated reward received by building summary y.Finally, the expected reward of performing π is: The goal of an MDP is to find the optimal policy π * that has the highest expected reward: π * = arg max π R RL (π).

Preference Learning
For a document cluster x ∈ X and its legal summaries set Y x , we let U * x : Y x → R be the ground-truth utility function measuring the quality of summaries in Y x .We additionally assume that no two items in Y x have the same U * x value.Let σ * x be the ascending ranking induced by U * where 1 is the indicator function.In other words, σ * x (y) gives the rank of y among all elements in Y x with respect to U * x .The goal of preference learning is to approximate σ * x from the pairwise preferences on some elements in Y x .The preferences are provided by an oracle.
The Bradley-Terry (BT) model (Bradley and Terry 1952) is a widely used preference learning model, which approximates the ranking σ * x by approximating the utility function U * x : Suppose we have observed N preferences: , where y i,1 , y i,2 ∈ Y x are the summaries presented to the oracle in the i th round, and p x indicates the preference direction of the oracle: p x = 1 if the oracle prefers y i,1 over y i,2 , and p x = 0 otherwise.The objective in BT is to maximise the following likelihood function: where Ûx is the approximation of U * x parameterised by w, which can be learnt by any function approximation techniques, e.g.neural networks or linear models.By maximising Eq. ( 3), the resulting w will be used to obtain Ûx , which in turn can be used to induce the approximated ranking function σx : Y x → R.
where p x is the same preference direction function as in preference learning ( §3.2). π is a policy that decides the probability of presenting a pair of summaries to the oracle: In line with preference learning, Ûx is the utility function for estimating the quality of summaries, parameterised by w.The policy π samples the pairs with larger utility gaps with higher probability; as such, both "good" and "bad" summaries have the chance to be presented to the oracle and thus encourages the exploration of the summary space.To maximise Eq. ( 5), SPPI uses gradient ascent to update w incrementally.Algorithm 1 presents the pseudo code of our adaptation of SPPI to EMDS.Note that the objective function in SPPI (Eq.( 5)) and the expected reward function in RL (Eq.( 1)) have a similar form: if we view the preference direction function p x in Eq. ( 5) as a reward function, we can consider SPPI as an RL problem.The major difference between SPPI and RL is that the policy in SPPI selects pairs (Eq.( 6)), while the policy in RL selects single summaries (see §3.1).For APRIL, we will exploit this connection to propose our new objective function and learning paradigm.
4 The APRIL Framework SPPI suffers from the high sample complexity problem, which we attribute to two major reasons: First, the policy π in SPPI (Eq.( 6)) is good at distinguishing the "good" summaries from the "bad" ones, but poor at selecting the "best" summaries from "good" summaries, because it only queries the summaries with large quality gaps.Second, SPPI makes inefficient use of the collected preferences: After each round of interaction, SPPI performs one step of the policy gradient update, but does not generalise or re-use the collected preferences.This potentially wastes expensive user information.To alleviate these two problems, we exploit the connection between SPPI, RL and preference learning and propose the APRIL framework detailed in this section.Recall that in EMDS, the goal is to find the optimal summary for a given document cluster x, namely the summary that is preferred over all other possible summaries in Y x according to σ * x .Based on this understanding and in line with the RL formulation of EMDS from §3.1, we define a new expected reward function R APRIL x for policy π as follows: Note that p x ( y i , y j ) equals 1 if y j is preferred over y i and equals 0 otherwise (see §3.2).Thus, yi∈Yx p x ( y i , y ) counts the number of summaries that are less-preferred than summary y, and hence equals σ * x (y) (see Eq. 2).Policy that can maximise this new objective function will select summaries with highest rankings, hence outputs the optimal summary.This new objective function decomposes the learning problem into two stages: (i) approximating the ranking function σ * x , and (ii) based on the approximated ranking function, searching for the optimal policy that can maximise the new objective function.These two stages can be solved by (active) preference learning and reinforcement learning, respectively, and they constitute our APRIL framework, illustrated in Figure 2.

Stage 1: Active Preference Learning
For an input document cluster x ∈ X , the task in the first stage of APRIL is to obtain σx , the approximated ranking function on Y x by collecting a small number of preferences from the oracle.It involves four major components: a sum-Input : Query budget N ; document cluster x; Summary DB D S (x); heuristic h; tradeoff β, learning rate α 1 let Ûx = h ; 2 get first summary y 1,1 by Eq. ( 9) ; 3 initialise w 0 while i = 1, . . ., N do 4 select y i,2 according to Eq. ( 9) ; 5 get preference p i x (y i,1 , y i,2 ) from the oracle, add to D P ; 8)) ; Output: Ûx and its induced ranking σx Algorithm 2: Active preference learning (Stage 1 in APRIL).
mary Database (DB) storing the summary candidates, an AL-based Querier that selects candidates from the Summary DB to present to the user, a Preference DB storing the preferences collected from the user, and a Preference Learner that learns σx from the preferences.The left cycle in Fig. 2 illustrates this stage, and Alg. 2 presents the corresponding pseudo code.Below, we detail these four components.
Summary DB D S (x).Ideally D S (x) should include all legal extractive summaries for a document cluster x, namely D S (x) = Y x .Since this is impractical for large clusters, we either randomly sample a large set of summary candidates or use pre-trained summarization models and heuristics to generate D S (x).Note that D S (x) can be built offline, i.e. before the interaction with the user starts.This improves the real-time responsiveness of the system.
Preference DB D P .The preference database stores all collected user preferences where N is the query budget (i.e., how many times a user may be asked for a preference), y i,1 , y i,2 ∈ D S (x) are the summaries presented to the user in the i th round of interaction, and p x is the user's preference (see §3.2).
Preference Learner.We use the BT model introduced in §3.2 to learn σx from the preferences in D P .In order to increase the real-time responsiveness of our system, we use a linear model to approximate the utility function U * x , i.e.Ûx (y) = w • φ(y, x), where φ(y, x) is a vectorised representation of summary y for input cluster x.However, purely using w • φ(y, x) to approximate U * x is sensitive to the noise in the preferences, especially when the number of collected preferences is small.To mitigate this, we approximate U * x not only using w • φ(s|x) (the posterior), but also using some prior knowledge h(y, x) (the prior), for example the heuristics-based summary evaluation function proposed by Ryang and Abekawa (2012) and Rioux et al. (2014).Note that these heuristics do not require reference summaries; see §2.Formally, we define the Ûx as where β ∈ [0, 1] is a real-valued parameter trading off between the prior and posterior.
AL-based Querier.The active learning based querier receives Ûx and selects which candidate pair from D S to present to the user in each round of interaction.To reduce the reading burden of the oracle, inspired by the preference collection workflows in robots training (Wirth et al. 2016), we use the following setup to obtain summary pairs: In each interaction round, one summary of the pair is old (i.e. it has been presented to the user in the previous round) and the other one is new (i.e. it has not been read by the user before).As such, the user only needs to read N + 1 summaries in N rounds of interaction.Any pool-based active learning strategy (Settles 2010) can be used to implement the querier, e.g., uncertainty sampling (Lewis and Gale 1994).We explore four computationally efficient active learning strategies: -Utility gap (∆ Ûx ): Inspired by the policy of SPPI (see §3.3 and Eq. ( 6)), this strategy presents summaries with large estimated utility gaps ∆ Ûx : -Diversity-based heuristic (div): This strategy minimises the vector space similarity of the presented summaries.For a pair y i , y j ∈ Y x , we define where cos is the cosine similarity.This heuristic encourages querying dissimilar summaries, so as to encourage exploration and facilitate generalisation.In addition, dissimilar summaries are more likely to have large utility gaps and hence can be answered more accurately by the users (discussed later in §5).-Density-based heuristic (den): This strategy encourages querying summaries from "dense" areas in the vector space, so as to avoid querying outliers and to facilitate generalisation.Formally, for a summary y for cluster x, we define den(y|x) = 1 − min y ∈D S (x),y =y div(y, y |x).
-Uncertainty-based heuristic (unc): This strategy encourages querying the summaries whose approximated utility Ûx is most uncertain.In line with P.V.S. and Meyer (2017), we define unc as follows: For a summary y ∈ D S (x), we estimate the probability of y being the optimal summary as and let the uncertainty of y be unc(y|x) = 1 − pb(y|x) if pb(s|x) ≥ 0.5, and let unc(y|x) = pb(y|x) otherwise.
To exploit the strengths of all these AL strategies, we normalise their output values to the same range and use their weighted sum to select the new summary y * to present to the user: where y is the old summary, i.e. the one from the previous interaction round.
To select the first summary, we let div(y, y ) = 0 and ∆ Ûx (y, y ) = Ûx (y).w g , w d , w e and w u denote the weights for the four heuristics.

Stage 2: RL-based Summariser
Given the approximated ranking σx learnt by the first stage, the target of the second stage in APRIL is to obtain We consider two RL algorithms to obtain π * : the linear Temporal Difference (TD) algorithm, and a neural version of the TD algorithm.TD (Sutton 1984) has proven effective for solving the MDP in EMDS (Rioux et al. 2014;Ryang and Abekawa 2012).The core of TD is to approximate the V -values: In EMDS, V π (s) estimates the "potential" of the (draft) summary s for input cluster x given policy π: the higher the V π (s) value, the more likely s is contained in the optimal summary for x.Given the V -values, a policy can be derived using the softmax strategy: where a ranges over all available actions in the state s.The intuition behind Eq. ( 10) is that the probability of performing the action a increases if the resulting state of a, namely P (s, a), has a higher V -value.Note the similarity between the policy of TD (Eq.( 10)) and the policy of SPPI (Eq.( 6)): they both use a Gibbs distribution to assign probabilities to different actions, but the difference is that an action in SPPI is a pair of summaries, while in TD an action is adding a sentence to the current draft summary or terminate (see §3.1).
Existing works use linear functions to approximate the V -values (Rioux et al. 2014;Ryang and Abekawa 2012).To more precisely approximate the V -values, we use a neural network and term the resulting algorithm Neural TD (NTD).Inspired by DQN (Mnih et al. 2015), we employ the memory replay and periodic update techniques to boost and stabilise the performance of NTD.We use NTD rather than DQN (Mnih et al. 2015) because in MDPs with discrete actions and continuous states, as in our EMDS formulation, Q-Learning needs to maintain a Q(s, a) network for each action a, which is very expensive when the size of a is large.Instead, the TD algorithms only have to maintain the V (s) network, whose size is independent of the number of actions.In EMDS, the size of the action set typically exceeds several hundreds (see Table 2), because each sentence corresponds to one action.Alg. 3 shows the pseudo code of NTD.We use the Summary DB D S as the memory replay.This helps us to reduce the sample generation time, which is critical in interactive systems.We select samples from D S using softmax sampling (line 3 in Alg.3): where θ stands for the parameters of the neural network.Given the selected summary y, we build a sequence of k states (line 4), where k is the number of sentences in y, and state s i is the draft summary including the first i sentences of y.Then, we update the error L TD as in the standard TD algorithms (lines 5 to 9) and perform back propagation with gradient descent (line 10).We update θ every C episodes (line 11), as in DQN, to stabilise the performance of NTD.
After finishing all training, the obtained V -values can be used to derive the π * by Eq. ( 10).

Preference-based Interaction for Summarisation
To date, there is little knowledge about the usability and the reliability of user feedback in summarisation.This is a major limitation for designing interactive systems and for effectively experimenting with simulated users before an actual user study.In this section, we therefore study preference-based feedback for our EMDS use case and derive a mathematical model to simulate real users' preference-giving behaviour.
Hypotheses.Our study tests two hypotheses: (H1) We assume that users find it easier to provide preference feedback than providing other forms of feedback for summaries.In particular, we measure the user satisfaction and the time needed for preference-based interaction and bigram-based interaction proposed by P.V.S. and Meyer (2017), which has also been used in interactive summarisation.
(H2) Previous research suggests that the more difficult the questions, the lower the correct rate of the answers or, in other words, the higher the noise in the answers (Huang et al. 2016;Donmez and Carbonell 2008).In our preference-giving scenario, we assume that the difficulty of comparing a pair of items can be measured by the utility gap between the presented items: the wider the utility gap, the easier it is for the user to identify the better item.We term this the wider-gap-less-noise hypothesis in this article.
The wider-gap-less-noise hypothesis is an essential motivation for the policy in SPPI (Eq.( 6)) and the diversity-based active learning strategy in APRIL (see §4.1), but yet there is little empirical evidence for validating this hypothesis.Based on the findings in our user study, we provide evidence towards H1 and H2, and we propose a realistic user model, which we employ in our simulation experiments in §6.
Study setup.We invite 12 users to participate in our user study.All users are native or fluent English speakers from our university.We ask each user to provide feedback for newswire summaries from two topics (d074b from DUC'02 and d32f from DUC'01) in the following way.
We first allow the users to familiarise with the topic by means of two 200words abstracts.This is necessary, since the events discussed in the news documents are several years old and maybe unknown to our participants.Without having such background information, it would not be possible for users to judge importance in the early stages of the study.We ask each user to provide preferences for ten summary pairs and to label all important bigrams in five additional summaries.For collecting preference-based feedback, we ask the participants to select the better summary (i.e. the one containing more important information) in each pair.For collecting bigram-based feedback, we adopt the setup of P.V.S. and Meyer (2017), who proposed a successful EMDS system using bigram-based interaction.At the end of the study, we ask the participants to rate the usability (i.e., user-friendliness) of preferenceand bigram-based interaction on a 5-point Likert scale, where higher scores indicate higher usability.
To evaluate H2, we require summary pair with different utility gaps.To this end, we measure the utility U * x (see §3.2) of a summary y for document cluster x as where r x are the reference summaries for document cluster x (provided in the DUC datasets), and R 1 , R 2 and R SU stand for average ROUGE-1, ROUGE-Summ A,1 (U * = 3.99): "I think he's doing a beautiful job up there."; President Bush, asked at a news conference whether Thomas' claim not to have an opinion on abortion is credible, answered, "That's a question for the Senate to decide.In their respective careers, the Thomases have embraced the view that women and minorities are hindered, rather than helped, by affirmative action and government programs.True equality is achieved by holding everyone to the same standard, they believe."; Before Thomas' testimony ended, the unflappable 43-year-old federal judge was criticized, sometimes in harsh terms, by several liberal Democrats.Hatch asked.Fig. 3: Two summary pairs from topic d074b with utility gaps ∆ = 1 (pair A, the upper two summaries) and ∆ = 5 (pair B, the bottom two summaries).

Summ
2 and ROUGE-SU4 recall metrics (Lin 2004), respectively.These ROUGE metrics are widely used to measure the quality of summaries.The denominator values 0.47, 0.22 and 0.18 are the upper-bound ROUGE scores reported by P.V.S. and Meyer (2017).They are used to balance the weights of the three ROUGE scores.As such, each ROUGE score is normalised to [0, 1], and we further multiply the sum of the ROUGE scores by 10 3 to normalise U * x values to [0, 10], which facilitates our analyses afterwards.
For document cluster x, the utility gap ∆ U * x of two summaries s 1 and s 2 is thus ∆ U * x (y 1 , y 2 ) = U * x (y 1 ) − U * x (y 2 ).As for the ten summary pairs in our user study, we select four pairs with utility gap ∆ = 1, three with ∆ = 3, two with ∆ = 5 and one with ∆ = 7, where ∆ = |∆ U * x | ± .1 (i.e., a utility gap very close to the predefined gap width).Figure 3 shows two example summary pairs and their U * x .As for the five summaries for bigram-based feedback, we select summaries with high utility U * x , but ensure that they have low overlap in order simulate the setup AL setup of P.V.S. and Meyer (2017).
Usability assessment.To evaluate hypothesis H1, we measure the easiness of providing preferences for summaries with two metrics: the average interaction time a participant spends in providing a preference and the participant's usability rating on the 5-point scale.We compare both metrics for preferencebased interaction with bigram-based interaction.Fig. 4 visualises the interaction time and the usability ratings for preference and bigram-based interaction as notched boxplots.Both plots confirm the clear difference between preference-and bigram-based feedback for summaries: We measure an average interaction time of 102 s (with standard error SE = 4 s) for annotating bigrams in a single summary, which is over twice the time spent for providing a preference for a summary pair (43 s, SE = 3 s).The users identified 7.2 bigrams per summary, which took 14s per bigram on average.As for the usability ratings, providing preferences is rated 3.8 (SE = 0.27) on average (median at 4), while labelling bigrams is rated 2.4 (SE = 0.22) on average (median at 2).These results suggest that humans can more easily provide preferences over summaries than providing point-based feedback in the form of bigrams.
Reliability assessment.To evaluate hypothesis H2, we measure the reliability of the users' preferences, i.e. the percentage of the pairs in which the user's preference is the same as the preference induced by U * .Figure 5 shows the reliability scores for the varying utility gaps employed in our study.The results clearly suggest that, for summary pairs with wider utility gaps, the participants can more easily identify the better summary in the pair, resulting into higher reliability.This observation validates the wider-gap-less-noise assumption.
Realistic user simulation.We observe that the shape of the reliability curves in Figure 5 is similar to that of the logistic function: when ∆ approaches 0, the reliability scores approaches 0.5 and with the increase of ∆, the reliability asymptotically approaches 1.Hence, we adopt the logistic model proposed by Viappiani and Boutilier (2010) to estimate the real users' preferences.We term the model logistic noise oracle (LNO): For two summaries y i , y j ∈ Y x , we assume the probability that a user prefers y i over y j is: where m is a real-valued parameter controlling the "flatness" of the curve: higher m yield a flatter curve, which in turn suggests that asking users to distinguish summaries with similar quality causes high noise.
We estimate m based on the observations we made in the user study by maximising the likelihood function: [p u ( y i,1 , y i,2 ) log P x (y i,1 y i,2 ; m) where u ranges over all users and i ranges over the number of preferences provided by each user.y i,1 and y i,2 are the summaries presented to the user in round n. p u is the user's preference direction function: p u ( y i,1 , y i,2 ) equals 1 if y i,1 is preferred by the user over y i,2 , and equals 0 otherwise.By letting ∂ ∂m L LNO (m) = 0, we obtain m = 2.14.The green curve in Figure 5 is the reliability curve for the LNO with m = 2.14.We find that it fits well with the reliability curves of the real users.As a concrete example, consider the summary pairs in Figure 3: LNO prefers Summ A,2 over Summ A,1 with probability .618and prefers Summ B,1 over Summ B,2 with probability .914,which is consistent with our observations that 7 out of 12 users prefer Summ A,2 over Summ A,1 , while all users prefer Summ B,1 over Summ B,2 .

Simulation Experiments
In this section, we study APRIL in a simulation setup.We use the LNO-based user model with m = 2.14 to simulate user preferences as introduced in §5.
We separately study the first and the second stage of APRIL, by comparing multiple active learning and RL techniques in each stage.Then, we combine the best-performing strategy from each stage to build the overall APRIL pipeline and compare our method with SPPI.We perform our experiments on three multi-document summarisation benchmark datasets from the Document Understanding Conferences2 (DUC): DUC'01, DUC'02 and DUC'04.
Table 2 shows the main properties of these datasets.To ease the reading, we summarise the parameters we used in our simulation experiments in Table 3. Table 2: For our experiments, we use standard benchmark datasets from the Document Understanding Conference (DUC).# Doc: the overall number of documents across all topics.SumLen: the length of each summary (in tokens).# Sent/Topic: average number of sentences in a topic.

APL Strategy Comparison
We compare our AL-based querying strategy introduced in §4.1 (see Eq. ( 9)) with three baseline AL strategies: -Random: In each interaction round, select a new candidate summary from D S uniformly at random and ask the user to compare it to the old one from the previous interaction round.In the first round, we randomly select two summaries to present.-J&N is the robust query selection algorithm proposed by Jamieson and Nowak (2011).It assumes that the items' preferences are dependent on their distances to an unknown reference point in the embedding space: the farther an item to the reference point, the more preferred the item is.After each round of interaction, the algorithm uses all collected preferences to locate the area where the reference point may fall into and identifies the query pairs which can reduce the size of this area, termed ambiguous query pairs.To combat noise in preferences, the algorithm selects the most-likelycorrect ambiguous pair to query the oracle in each round.

Parameter Description
For APL (stage 1 in APRIL); see Alg. 2: N = 10, 50, 100 query round budget |D S (x)| = 5000 Summary DB size for each cluster x (see §4.1) h heuristics-based prior reward (see §4.1 and Eq. ( 8)); we use the reward heuristics proposed by Ryang and Abekawa (2012) β = 0.5 trade-off between prior and posterior rewards (see Eq. ( 8)) α = 10 −3 learning rate for preference learning φ(y, x) vectorised representation of summary y for document cluster x (see Eq. ( 8)); we use the same vector representation as  -Gibbs: This is the querying strategy used in SPPI.In each round, it selects summaries y i , y j with the Gibbs distribution (Eq.6), and updates the weights for the utility function as in line 5 in Alg. 1.
Note that Gibbs presents two new summaries to the user each round, while the other querying strategies we consider present only one new summary per round (see §4.1).Thus, in N rounds of interaction with a user, the user needs to read 2N summaries with Gibbs, but only N + 1 with the other querying strategies.
To find the best weights w g , w d , w e and w u for our AL querying strategy in Eq. ( 9), we run grid search: We select each weight from {0, 0.2, 0.4, 0.6, 0.8, 1} and ensure that the sum of the four weights is 1.0.The query budget N was set to 10, 50 and 100.For each cluster x, we generated 5,000 extractive summaries to construct D S (x).Each summary contains no more than 100 words, generated by randomly selecting sentences in the original documents in x.The prior h used in Eq. 8 is the reward function proposed by Ryang and Abekawa (2012), and we set the trade-off parameter β to 0.5.All querying strategies we test take less than 500 ms to decide the next summary pair to present.
The performance of the querying strategies is measured by the quality of their resulting reward function Ûx (see Eq. ( 8)).For each cluster x, we measure the quality of Ûx by its Spearman's rank correlation (Spearman 1904) to the gold-standard utility scores U * x (Eq.( 12)) over all summaries in D S (x).We normalise Ûx to the same range of U * x (i.e.[0,10]).For the vector representation φ, we use the same 200-dimensional bag-of-bigram representation as Rioux et al. (2014).Lower bound, N = 0, β = 0: .194 Table 4: Spearman's rank correlation between Ûx and U * x , averaged over 20 independent runs on all clusters x in DUC'01.Ûx is learnt with different querying strategies.The lower bound is to prohibit all interactions (N = 0) and let Û = h (i.e.β = 0 in Eq. 8).Results marked with an asterisk are significantly better than all baselines.
Table 4 compares the performance of different querying strategies.We find that all querying strategies outperform the zero-interaction lower bound even with 10 rounds of interaction, suggesting that even collecting a small number of preferences can help to improve the quality of Ûx .Among all baseline strategies, Gibbs significantly3 outperforms the other two, and we believe the reason is that Gibbs exploits the wider-gap-less-noise assumption (see §5).Of all 56 possible AL weights combinations, 48 combinations outperform the random and J&N baselines, and 27 outperform Gibbs.This shows the overall strength of our AL-based strategy.The best combination of the weights is w g = 0, w d = 0.6 and w e = w u = 0.2, closely followed by using the diversity-based strategy div alone (i.e.w d = 1).We believe the reason behind the effectiveness of the div strategy is that it not only exploits the wider-gap-less-noise assumption by querying dissimilar summaries, but also explores summaries from different areas in the embedding space, which helps the generalisation.Due to its simplicity, we use w d = 1 henceforth, since its performance is almost identical and has no statistically significant difference to the best combination.

RL Comparison
We compare NTD (Alg.3) to two baselines: TD (Sutton 1984) and LSTD (Boyan 1999).TD has been successfully used by Ryang and Abekawa (2012) and Rioux et al. (2014) for EMDS.LSTD improves TD by using least square optimisation, and it has been proven to perform better in large-scale problems than TD (Lagoudakis and Parr 2003).Note that both TD and LSTD uses linear models to approximate the V -values.We use the following settings, which yield good performance in pilot experiments: Learning episode budget T = 3000 and learning step α = .001in TD and NTD.For NTD, the input of the V -value network is the same 200-dimensional draft summary representation as in (Rioux et al. 2014); after the input layer, we add a fully connected ReLU (Glorot et al. 2011) layer with dimension 100 as the first hidden layer; an identical fully connected 100dimensional ReLU layer is followed as the second hidden layer; at last, a linear output layer is used to output the V -value.Fig. 6 illustrates the structure of the V -values network.We use Adam (Kingma and Ba 2014) as the gradient optimiser (line 10 in Alg.3), with default setup.For LSTD, we initialise its square matrix as a diagonal matrix and let the diagonal elements be random numbers between 0 and 1, as suggested by Lagoudakis and Parr (2003).
The rewards we use are based on U * defined in Eq. ( 12).Note that this serves as the upper-bound performance, because U * is the gold-standard scoring function, which is not accessible in the interactive settings (see §6.3 and §7).We measure the performance of the three RL algorithms by the quality of their generated summaries in terms of multiple ROUGE scores.Results are presented in Table 5. NTD outperforms the other two RL algorithms by a large margin.We assume that this is attributed to its more precise approximation of the V -values using the neural network.
In terms of computation time,4 TD takes around 30 seconds to finish the 3,000 episodes of training and produce a summary, NTD takes around 2 minutes, while LSTD takes around 5 minutes.Since the RL computation is performed only once after all online interaction has finished, we find this computation time acceptable.However, without using D S as the memory replay, NTD takes around 10 minutes to run 3,000 episodes of training.

Full System Performance
We compare SPPI with two variants of APRIL: APRIL-TD and APRIL-NTD that use TD and NTD, respectively.Both implementations of APRIL use the diversity-based AL strategy (i.e.div = 1.0).All the other parameters values are the same as those described in §6.1 and §6.2 (see Table 3).
Results on DUC'01 are presented in Table 6.When no interaction is allowed (i.e.N = 0, Û = h), we find that the performance of the three algorithms shows no significant differences.With the increase of N , the gap between both APRIL implementations and SPPI becomes larger, suggesting the advantage of APRIL over SPPI.Also note that, when N = 0 and N = 10, APRIL-NTD does not have significant advantage over APRIL-TD, but when N ≥ 50, APRIL-NTD significantly outperforms APRIL-TD in terms of ROUGE-1 and ROUGE-L.This is because when N is small, the learnt reward function Û contains much noise (i.e. has low correlation with U * ; see Table 4) and the poor quality of Û limits the advantage of the NTD algorithm.The problem gets relieved with the increase of N .The above observations also apply to the experiments on DUC'02 and DUC'04; their results are presented in Tables 7  and Tables 8, respectively.We attribute the superior performance of APRIL to two factors: (i) noise robustness: SPPI purely relies on the collected preferences to improve its policy (see Alg. 1), while our reward estimation considers both collected preferences as well as heuristics to mitigate the noise in the preferences (see Eq. ( 8)).(ii) more rounds of training: our method can update its RL policy for as many episodes as we want (T N ) while SPPI can only update its policy for up to N rounds (see Alg. 1).This property enables APRIL to thoroughly exploit the information from the user preferences.Fig. 7 illustrates that, with the same reward, the quality of the APRIL-generated summary grows with the increase of T .This is because with more episodes of learning, the smaller the error between the RL-generated policy and the optimal policy (Bertsekas and Tsitsiklis 1996).But the computational time is also increased with the the growth of T : every 1000 episodes costs around 10 seconds training time for TD, and around 3 minutes for NTD.We hence let T = 3000 (see Table 3) in our experiments to trade off between the performance and real-time responsiveness.

Human Evaluation
Finally, we invite real users to evaluate the performance of APRIL in two experiments: First, we test whether APRIL can improve the no-interaction baseline after a few rounds of interaction with the user.Second, we test whether APRIL outperforms SPPI given the same query budget.To conduct the experiments, we develop a web interface that accommodates both SPPI and APRIL.Three document clusters are randomly selected from the DUC datasets (d30046t, d100e and d068f).Seven users participate in our experiments, all of them are native or fluent in English from our university.Due to the similar performance of APRIL-TD and APRIL-NTD with N = 10 interaction rounds (see Tables 6, 7 and 8), we use APRIL-TD throughout our experiments, because of its faster computation time (see §6.2).

APRIL vs. No-Interaction
Following the setup of our user study introduced in §5, we first allow the users to understand the background of the topic with two 200-word abstracts.Then, we ask them to interact with APRIL for N = 10 rounds and finally present both the no-interaction summary (i.e.β = 0 in Eq. ( 8)) and the withinteraction summary using APRIL-TD (β = 0.5) to the users to ask for their final preference.Note that, because our AL strategy selects the summaries to present based on each user's previous preferences, the summaries read by each user during the interaction are different.
Table 9 presents the results.The column "Human" shows how often the participants prefer the with-interaction summary over the no-interaction summary.For all document clusters we have tested, the users clearly prefer the with-interaction summaries, suggesting that APRIL can produce better summaries with just 10 rounds of interaction.In addition, we find that with increasing utility gap ∆ U * between the with-and no-interaction summaries, the with-interaction summaries are preferred by more users.The column "LNO" compares this finding with our LNO-based user simulation, whose probability also increases with ∆ U * .This observation yields further evidence towards our wider-gap-less-noise assumption discussed in §5.Also, the Pearson correlation between the real users' and LNO's preference ratio (i.e. the second and third column in Table 9) is .953with p-value .197,which confirms the validity of the LNO-based model.However, we also note that in topic d068f, the withinteraction summaries' average U * is lower than the no-interaction ones, but still more than 50% of the users prefer the with-interaction summaries.We believe that this is due to the mismatch between the ROUGE-based U * and users' real judgement of the summaries.

APRIL vs. SPPI
We invite seven users to judge the quality of SPPI and APRIL summaries in the following way: We use six randomly selected APRIL-generated withinteraction summaries (two per document cluster) from the first experiment ( §7.1) and pair them with six new SPPI-generated summaries on the same clusters.To generate the SPPI summaries, we ask two additional users to interact with SPPI for ten interaction rounds on the same three document clusters and in the same manner as in the first experiment.Then, we ask the Table 9: Most users prefer the with-interaction summaries over the nointeraction summaries.Human: the percentage of pairs in which users prefer the with-interaction over the no-interaction summaries.LNO: the percentage of pairs in which the LNO simulated user prefers with-interaction.∆ U * x : the average improvement of the with-interaction summaries over the no-interaction summaries in terms of U * x .
seven users of the actual study to provide a preference judgement towards the best summary of each pair and additionally rate the quality of each summary on a 5-point Likert scale (higher score means higher quality).Some summaries presented to the users in this user study are presented in Fig. 8.Note that in all previous work we are aware of (P.V.S. and Meyer 2017; Kreutzer et al. 2017;Gao et al. 2018), the evaluation was based on simulations with a perfect user oracle.Therefore, we expect that our results with real user interaction better reflect the true results.
Table 10 presents the results.In two out of three clusters, the APRILgenerated summaries are clearly preferred by the users and receive higher ratings.The exception is cluster d30046t, where users equally prefer the SPPIand APRIL-generated summaries and give them similar ratings.By looking into these summaries (see the top row in Fig. 8), we find that both summaries grasp the main idea of the document cluster (Checkpoint Charlie is removed with a ceremony, attended by diplomats from Germany and World War II allies), but also include some less important information (e.g."I ran as fast as I can" in the summary by SPPI, and "This a nice way ..." in the one by APRIL).The top row of Table 9 suggests that users overwhelmingly prefer APRIL-generated summaries over no-interaction summaries for this cluster d30046t.This suggests that both interactive approach generate summaries of similar high quality for this cluster.As the cluster is about a single short-term event, we speculate that both interactive approaches can easily grasp the users' needs for such events and produce equally good summaries.However, for more complex clusters that are about multiple events (e.g.d100e, which talks about a series events happened on multiple politicians) or about events happened across a long time range (e.g.d068f, which talks about events before and after the death of John Lennon), APRIL can more precisely grasp the need of the users and hence generate better summaries than SPPI.
To summarise, given the same query budget N , APRIL generates comparable or superior quality summaries compared to SPPI, while its reading load is almost half of SPPI (APRIL requires the users reading N + 1 summaries, while SPPI requires 2N ; see §4.1).Also, we find a high correlation between Cluster d30046t, SPPI: The famed Allied checkpoint by the Berlin Wall was closed with an elaborate ceremony that brought together the top diplomats from the Germanys and the four World War II Allies.Maik Polster was a stern-faced member of the East German secret police.Checkpoint Charlie, the famed Allied border crossing by the Berlin Wall, was to be hauled away Friday."And now, 29 years after it was built, we meet here today to dismantle it and to bury the conflicts it created."It was part of my home.""I ran as fast as I could," he said.U.S. Army Sgt.
Cluster d30046t, APRIL: The famed Allied checkpoint by the Berlin Wall was closed with an elaborate ceremony that brought together the top diplomats from the Germanys and the four World War II Allies."This is a nice way to end my military service to be here when they take it down," said Walsh, 23, a military police officer who leaves the army in six weeks to study for the priesthood.Checkpoint Charlie, the famed Allied border crossing by the Berlin Wall, was to be hauled away Friday.East Germany's border guards were as feared as members of the secret police.U.S. Army Sgt.
Cluster d100e, SPPI: The smart money argues that the Senate could not muster the 67 votes that would be needed to remove the wounded president from office, which would require the defection of 12 Democrats if all the Republicans stand against him.In an incredibly unseemly display, Trent Lott, the majority leader, and former Bush national security adviser Brent Scowcroft and Bush Secretary of State Lawrence Eagleburger chimed in on the attack.Rep. Thomas Barrett, a Democrat from Wisconsin, tried to remind his Republican colleagues that the Constitution "does not allow you to remove a president from office because you can't stand him." Cluster d100e, APRIL: Bob Livingston, the incoming speaker of the House, took no public role Friday as the debate unfolded on whether to impeach President Clinton."We 're losing track of distinction between sins and crimes," said Rep. Jerrold Nadler, D-N.Y. "We 're lowering the standards of impeachment.But at the White House, where calls for Clinton's resignation are derided as a Republican strategy, the president sent a spokesman into the driveway to urge Livingston to reconsider his resignation.It has gotten to the point where drastic action may be necessary.The only thing certain now is uncertainty.You resign !" Cluster d068f, SPPI: Say what you want about Albert Goldman, the author of the new biography, "The Lives of John Lennon" (Morrow, $22.95), but you 've got to hand it to him : This guy is one ambitious sleazemonger.John Lennon's worldwide message of peace was delivered Tuesday as his song "Imagine" was played simultaneously for 1 billion people in 130 countries to celebrate what would have been his 50th birthday.The image of a dour, shoeless English boy and his absent, carefree mother prompted Julia Baird and Geoffrey Giuliano to collaborate on a book."I believe in fairies, the myths, dragons.Surprised?Cluster d068f, APRIL: John Lennon's worldwide message of peace was delivered Tuesday as his song "Imagine" was played simultaneously for 1 billion people in 130 countries to celebrate what would have been his 50th birthday.The image of a dour, shoeless English boy and his absent, carefree mother prompted Julia Baird and Geoffrey Giuliano to collaborate on a book.Cynthia Lennon joins the throng denouncing the new, unauthorized biography of her late former husband, John Lennon, as written by a money-hungry author capitalizing on untruths."I believe in fairies, the myths, dragons.Lennon's widow, Yoko Ono, asked the crowd.Happy birthday, John.
Fig. 8: Summaries generated by SPPI and APRIL after 10 rounds of interaction with real users.the real users' and the LNO's preference ratio (Pearson correlation .974with p-value .145),which confirms again the validity of LNO.

Conclusion
In this work, we propose a preference-based interactive document summarisation framework.which interactively learns to generate improved summaries based on user preferences.We focused on two research questions in this work, Q m : the average ratings of the summaries generated by method m.Asterisk: significant advantage of APRIL over SPPI.
(i) can users easily provide reliable preferences over summaries, and (ii) how to mitigate the high sample complexity problem.For question (i), we showed in a user study that users are more likely to provide reliable preferences when the quality gap between the presented summaries is big, and users find it is easier to provide preferences than other forms of feedback (e.g.bigrams).For question (ii), we proposed the APRIL framework, which splits the reward learning and the summary searching stage.This split allows APRIL to more efficiently query the user and more thoroughly exploit the collected preferences by using active preference learning algorithms, and more effectively search for the optimal summary by using reinforcement learning algorithms.Both our simulation and real-user experiments suggested that, with only a few (e.g.ten) rounds of interaction, APRIL can generate summaries better than the non-interactive RL-based summariser and the SPPI-based interactive summariser.APRIL has the potential to be applied to a wide range of other NLP tasks such as machine translation and semantic parsing.

Fig. 1 :
Fig.1: SPPI (a) directly uses the collected preferences to "teach" its summarygenerator, while APRIL (b) learns a reward function as the proxy of the user/oracle, and uses the learnt reward to "teach" the RL-based summariser.

Fig. 2 :
Fig. 2: Detailed workflow of APRIL (extended version of the workflow presented in Fig. 1b)

Fig. 4 :
Fig. 4: Comparison of interaction time and usability ratings for preference and bigram-based interaction.Notches indicate 95% confidence interval.

Fig. 5 :
Fig. 5: The reliability of users' preferences increases with the growth of the utility gaps between presented summaries.Error bars indicate standard errors.

Fig. 6 :
Fig. 6: The structure of the V -values network used in NTD.

Fig. 7 :
Fig. 7: With the same rewards ( Û with N = 300) and the same RL algorithm (TD), the quality of the generated summary (in terms of ROUGE-1) increases with the growth of the episode budget.All results are averaged over all document clusters in DUC'01.Error bars indicate standard errors.

Table 3 :
Overview of the parameters used in simulation experiments.

Table 5 :
NTD outperforms the other TD algorithms across all DUC datasets.All results are averaged over 10 independent runs across all topics in each dataset.Asterisk: significant advantage.

Table 7 :
Results with N rounds of interaction with the LNO-based simulated user in DUC'02.All results are averaged over all topics in DUC'02.Asterisk: significantly outperforms SPPI.Dagger: significantly outperforms both SPPI and April-TD.

Table 8 :
Results with N rounds of interaction with the LNO-based simulated user in DUC'04.All results are averaged over all topics in DUC'04.Asterisk: significantly outperforms SPPI.Dagger: significantly outperforms both SPPI and April-TD.

Table 10 :
Most users prefer the summaries generated by APRIL over those by SPPI.Human: the percentage that users prefer APRIL over SPPI.LNO: the percentage that the LNO-based simulated user prefers APRIL over SPPI.