Keywords

1 Introduction

Automatic text summarisation is described as the reduction of a text to its important content [1]. There has been significant progress in the area in the last 50 years. However, this task still poses many challenges to the scientific community. Sports news summarisation is one application of such techniques, which focuses on generating rich and concise text summaries that capture the essence of sports-related articles. There are two key methods used in text summarisation, namely extraction and abstraction. The former strategy selects sentences and phrases from the original text as a summary, while the latter strategy can potentially involve the generation of new relevant content to use in a summary.

This paper looks at two main problems in the field of automatic text summarisation. Firstly, are abstractive or extractive algorithms more appropriate for summarisation in particular domains of interest? Secondly, how do we effectively evaluate the summaries generated by different algorithms? There exist a number of automatic evaluation metrics that are widely applied in this area [2, 3]. However, these are sometimes not sufficient on their own. This paper proposes a new hybrid evaluation system that uses both automatic evaluation metrics alongside human evaluation.

Through investigating the above problems, this paper aims to achieve a number of goals. Firstly, finding the best algorithms and comparing their performance in different article domains. This is done to examine if an algorithm’s performance is domain-dependent. This will enable researchers to determine what summarisation algorithm to consider, depending on the domain of the corpus that they want to summarise. Secondly, exploring what effect different dialects might have on the performance of the summarisation algorithms. Finally, in terms of evaluation, we aim to develop a hybrid system to evaluate the performance of the algorithms when summarising sports news and to discover if automatic evaluation metrics correlate well with human judgements.

2 Background Research

Interest in the area of text summarisation started to increase in the 1950’s due with increases in news production and consumption. The first research paper to address this task was published in 1958 at IBM and targeted the automatic creation of literature abstracts [4]. The paper set multiple ground rules that are now the fundamental steps in most summarisation tasks. These include reverting words to their stems, removing stop words, and having some form of a table of words with corresponding significance scores.

Extractive text summarisation techniques determine which words, phrases, and/or sentences from the original text are incorporated into a short summary [5]. In this case no new content is created by the algorithm. Instead, the entire summary is derived from the original author’s words. One of the most popular extractive algorithms is TextRank [6]. Based on Google’s PageRank algorithm, it is an unsupervised technique that extracts keywords and sentences from text documents to use as a summary. A graph is used where sentences are vertices and the weight of the edges depends on a given sentence similarity function. Extractive methods are mainly concerned with the content of the summary, rather than how it reads and flows.

Abstractive text summarisation creates and combines an abridgment by using vocabulary words. This is then put together to give the summary of the original document [5]. These algorithms generate a concise summary that captures the essence and primary ideas of the original document, but can also contain new phrases that were not in the original text. There is a great emphasis on the form of the summary and its grammatical structure. A popular example of an abstractive technique summarisation is BART [7]. This technique includes a pre-training approach – it is first trained by firstly corrupting text with an arbitrary noising function, and it then learns a model to reconstruct the original text.

In order to compare the performance of different summarisation algorithms, appropriate evaluation techniques are required. Over time, evaluation approaches have varied from manual inspection, comparison to a human-written summary, and more complex scoring measures, such as ROUGE and BLEU. It is also possible to construct new evaluation techniques as was done in [5], depending on the use case of the system. The authors in that paper proposed an evaluation technique called, PolyTope, which is an error-oriented fine-grained human evaluation method consisting of 8 elements or issue types. Work in this area is ongoing and choosing the right metric is often subjective and domain-dependent.

There has been very limited work in the specific area of sports news summarisation. All of the above extractive and abstractive techniques can be applied to sports news articles, as they are simply documents of unstructured text. However, it is interesting to see which algorithms perform better with sports articles and if the algorithm performance is affected by the domain of news. Initial work in [8] discusses the summarisation of live sports commentary from Chinese news outlets. The authors suggested that a new algorithm is needed to deal with the different writing styles based on different languages and cultures. The proposed algorithm, SportsSum, outperformed various baseline methods.

Based on a review of the literature, we are left with two main considerations. There are no current papers that discuss and compare the different algorithms in both extractive and abstractive summarisation in the domain of sports news. Secondly, there is currently no user interface that helps users to pick which summarisation algorithm to use, based on pre-defined text input examples. In this work we try to address both these limitations.

3 Data Considerations

The data set that was chosen for use in this paper is the benchmark CNN/DM news collection [9]. This corpus is labelled, which means the news articles are associated with highlights. These highlights are short human-written summaries that can be used to in the evaluation of automatically-generated summaries. A version of the data which was pre-processed into clean CSV files was usedFootnote 1. For the purpose of this comparative study, the data was combined into a single representation, consisting of 311,971 articles in total.

Following that, these articles needed to be split into two groups: (i) US CNN articles; (ii) UK Daily Mail (DM) articles. This was done by filtering articles that start with (CNN). This excludes CNN articles from other countries and DM articles. DM articles were identified by finding articles that start with ‘By . Daily Mail’. This resulted in two sets, one with CNN articles containing 50,335 articles and the other with 149,884 Daily Mail articles.

In addition, the generalisation of summarisation algorithms across two different domains was explored. The set from each news outlet was filtered to find only sports and politics articles, both identified using manual sets of keywords. An article needed to contain two or more keywords to be categorised as part of that domain. This resulted in 661 CNN sports and 351 politics articles, while the Daily Mail (DM) provided 2,106 sports and 682 politics articles. These are treated as four separate sets for later experimentation.

4 Implementation

We conduct a comparative study to identify the best algorithms and compare their performance in different domains. We also explore the difference in performance depending on the English dialect used (British English vs American English). A human evaluation was also performed to support this comparison and determine whether automatic evaluation agrees with human judgement.

4.1 Comparative Study

This comparative experiment was implemented using Python 3, based upon the libraries sumy [10] and transformers [11]. The purpose of this study is to generate multiple summaries for every single article to compare the performance of each of the algorithms. We focus on four popular algorithms. The two extractive ones are LexRank and LSA, while the abstractive ones are BART and Pegasus. The abstractive algorithms require training. To robustly evaluate their performance, the two abstractive algorithms are run twice, once using a model pre-trained on the CNN/DM data set and once on another training set. The second data set allows their performance on completely unseen data to be evaluated. BART1 and Pegasus1 are trained on the CNN/DM data. On the other hand, BART2 is trained using the BART-based data, which is basic English language with no fine-tuning [12]. Pegasus2 is trained using the wiki-how data set, which is a basic English language article and instructions data set [13].

LexRank is an unsupervised graph-based approach for automatic text summarisation. The paper that introduces it views the extraction of sentences as trying to find the most central ones that give a sufficient amount of information [14]. The articles went through tokenising and preprocessing before being passed to the algorithm. The number of sentences required in the summary was set to 2, after analysing the human-generated summaries in the CNN/DM data set.

Latent Semantic Analysis (LSA) is based on linear algebra method which extracts hidden semantic structures of words and sentences (i.e. it extracts the features that cannot be directly mentioned). These features are essential to data, but are not original features of the data set. It is an unsupervised approach which is implemented in a similar way to LexRank.

BART is trained by firstly corrupting text using an arbitrary noising function, and then learning a model to reconstruct the original text. It is fined tuned to a standard sequence-to-sequence model [7]. In the two BART runs, the summarisers are first initialised using the models required. The body of each article is then passed into these summarisers as well as a minimum and a maximum length for the summary, these were set to 30 and 130 words respectively. This was done after testing the length of the summaries yielded. The final parameter passed relates to early stopping, which is used to avoid over-fitting as a way of regularising the summaries. In effect, this stops the beam search when a certain threshold of sentences is finished per batch.

Pegasus was developed by Google AI in 2020. It uses an encoder-decoder model for sequence-to-sequence learning [15]. Pegasus avoids the naive approach of training like an extractive summariser and makes full use of the large training corpus. It masks whole sentences and concatenate gap-sentences into a pseudo-summary which enhances it as an abstractive algorithm [15].

4.2 Web Application

A new user interface was developed where the key objectives are: 1) to enable users to compare the performance of summarisation algorithms; 2) to allow users to select the topic of the article to be summarised; 3) to give users an overview of the algorithms and the automatic evaluation techniques to help them to understand what the results mean; 4) to provide a tool to conduct user studies to compare the performance of the summarisation algorithms for the purpose of this project. The application was developed in Python, using the Flask library.

Each of these design goals has a corresponding subset of stakeholders. The web application will be used by individuals who would like to find an article with multiple summaries for various uses such as find out which summarisation algorithm to use in their own projects, users who are conducting deep research into the summarisation algorithms and would like to see their overall performance before digging deeper into them using the links provided as a starting point, and researchers who would like to conduct user studies.

With regards to researchers who might conduct user studies, in the web interface they can choose if they want articles from CNN or DM, and the topic of those articles. Once this is selected, the user is navigated to a page with five articles and six summaries per article corresponding to the algorithms chosen at random. The first design consideration here was to remove the automatic evaluation metrics to avoid any bias in the user study. Otherwise, participants in a study might tend to rate the summaries that have a high metric score better. Such user studies usually require a few participants to rate the same article, therefore it is a bad idea to use the entire data set of hundreds of articles as that makes it unlikely that two participants rate the same article. Instead, a small pool of 20 articles per subsection was created. To avoid the risk of users learning a bias towards certain algorithms, the summaries were labelled with letters A to F (instead of algorithm names) and the order of presentation was randomised.

5 Evaluation

Evaluating the usefulness of the summary of an article can be a very subjective and challenging task. To this day there is no single perfect measure [16]. We now evaluate how close an algorithm-generated summary is to a human one, as well as its coherence, relevance and content.

5.1 Evaluation Metrics

There are the two key metrics that have been used to automatically evaluate the summaries generated in this article.

The ROUGE measure has been used widely in many studies that evaluate text summarisation performance [3]. It is a recall-oriented evaluation that calculates the lexical overlap between the output generated by the system and some test summary [5]. These test summaries may create a bias as they act as a baseline. Two common variations of the ROUGE score have been used:

  • ROUGE-1 looks for overlap of single word uni-grams when comparing the algorithm’s summary with the human-written reference summary. This is considered as a baseline.

  • ROUGE-2 looks for the overlap of two-worded bi-grams when comparing the algorithm’s summary with the reference summary. This is a more strict measure than ROUGE-1 and will generally yield lower scores. The reason it was chosen is to assess the fluency of the summaries.

The other metric considered is the BLEU score (Bilingual Evaluation Understudy Score) and it measures the precision, which is how many words in the algorithm-generated summary also appear in the reference summary. A perfect match results in a score of 1.0. Note that BLEU also computes a brevity penalty [2]. The two variations selected:

  • BLEU-1: This refers to the implementation that counts matching uni-grams where each uni-gram is a word and acts as a baseline.

  • BLEU-2: Unigrams up to 4-grams are assigned different weights as follows (0.6, 0.25, 0.1, 0.05). This weight distribution was chosen to allow the score to reflect the overlapping of the summaries by giving the uni-gram the highest weight and also allowing it to account for overall fluency by giving descending weights to the following n-grams.

Basing the evaluation of a summary on a frequency-based metric can bias the score against abstractive techniques, since they change the words used to enhance the flow of the summary [17]. In addition, we are likely to want sports news summaries to be engaging, which might be difficult for an automated scoring system to assess. Hence, human evaluation will be used too.

5.2 Human Evaluation

Another type of intrinsic assessment is text quality evaluation. This can consider aspects of summaries ranging from the quality of grammar to the level of content redundancy. Automated techniques still struggle to measure such human-subjective features [5, 17]. After analysing the different attributes that have been considered in previous experiments, a list of attributes was compiled: Fluency, Relevance, Duplication, Inclusion, and Exclusion. The first two measures are best suited for a rating approach, while the latter three for an error-spotting approach. Thus the proposed score will is split into two parts: (i) a Coherence score which covers fluency and relevance; (ii) an Error score which covers the other attributes.

The web application described above used to conduct our human evaluation. A pilot experiment was first run to fine tune the evaluation process. This allowed us to identify an issues where the pilot participant started building a bias towards a certain letter choice (i.e., a particular algorithm). This was dealt with by randomising the algorithms’ order and the letter mapping. The participant complained about the length of some of the articles. There is a trade-off between the variety of article lengths and participants’ attention span. To counteract this, only articles that have 225 words or less were displayed to the participants to maintain their attention span after analysing the data set. In the pilot study the participant was asked to not only identify the number of errors but also to comment on their syntactic type. This was not effective as the summaries are short and do not have much variety for such detail. Comments on errors types are not requested in the final experiment.

The full user experiment included 10 participants, each assigned 10 sports articles split between the CNN and DM article sets. For each article, users had to rate the six summaries. This included a rating out of 10 in the first two measures and identification of errors for the latter three. Participants included an equal amount of student men and women, with varying proficiencies of English.

Score Calculation. Each evaluation attribute contributes a different percentage towards the overall score. The reasons for the different relative weights are listed in Table 1.

Table 1. Explanation of the weights and how they contribute to the overall score.

Limitations. Only shorter articles from the CNN/DM data set were used, which limits the experiment somewhat, as it does not allow a large variation window in article length. In addition, a user’s performance could change over time, which causes their judgement to vary due to either a better understanding of the task or due to fatigue.

6 Results

The sports articles were passed to the six summarisation algorithms. The resulting summaries were passed to the four automatic evaluation measures. The results can be seen in Table 2. The human evaluation was conducted on 20 sports articles, 10 from CNN and 10 from DM. The experiment had 10 participants as explained in Sect. 5.2. The results of this experiment are reported in Table 3.

Table 2. Summary of performance for automatic evaluation metrics.

6.1 Analysis

A range of findings can be inferred from these results. To test the significance of these findings, a paired t-test is used. It is a parametric procedure hence it makes several assumptions, the main one being the data having approximately a normal distribution. To check this, the Shapiro-Wilk test was used. The significance level used is 0.05 which indicates a 5% risk of concluding that a difference exists when there is no actual difference. Overall, human evaluation metrics are on average significantly higher than the automatic metrics. This can be due to human-linked biases. Automatic metrics that take into consideration 1-gram overlaps score higher than those that take into account more. This is because it is more likely for one word to match the reference summary than a sequence of words.

Table 3. Summary of performance for human evaluation. (A High score indicates that the algorithm performed well)

Extractive vs. Abstractive. From the results for automated measures, it is apparent that abstractive techniques outperform extractive in the domain of sports articles. However, in the human evaluation metrics, there is a slight advantage for the extractive techniques. This was an unexpected result. The reason was hypothesised to be the two abstractive models that were not pre-trained on the CNN/DM data set. They brought the average down and made a significant difference only in human evaluation due to the small range that the participants rated within. Further analysis needed to be done. A one-tailed paired t-test was conducted. There was a significant difference in the performance of extractive algorithms (M = 0.364, SD = 0.304) compared to abstractive algorithms (M = 0.433, SD = 0.257), t(5) = −2.677, p = 0.022. This means that abstractive algorithms have indeed performed better. The one-tailed test is more appropriate than two-tailed in this situation since the hypothesis we are interested in is determining if extractive algorithms perform worse. Another t-test was conducted on the data this time excluding abstractive techniques that were not trained on CNN/DM. The test showed an even more significant difference.

CNN vs. DM. The CNN and DM article sets are written in American and British English respectively. Overall, the writing dialect of sports articles, whether American or British, does not affect the performance of the algorithms. However, the CNN summaries seem to outperform DM summaries in the automatic metrics while the opposite is true for human metrics. This is likely due to a limitation in the human evaluation as the participants are based in Ireland which likely shares more similar language usage to the UK than to the US. The t-test failed to reject the null hypothesis, that is the mean performance on CNN and DM is not significantly different.

Best Performing Algorithms. The algorithm that scores the best in each evaluation measure has certain desirable features. Pegasus1 scored the best in six evaluation measures, it is the pre-trained on the CNN/DM data set. Pegasus1 is the most effective summarisation algorithm for sports articles, followed by BART1. The latter loses out in the automatic evaluation metrics. This may be due to BART1 rephrasing the summary while keeping the same meaning. This is a limitation of such metrics. LSA is the worst performing algorithm in terms of ROUGE1 and ROUGE2 scores. This signifies that LSA summaries have poor recall. However, the worst performer in the remaining evaluation measures is Pegasus2. It is the same algorithm as Pegasus1 but is pre-trained on the wiki-how data set. Pegasus2 demonstrated low precision and coherence, with many errors. In general, extractive algorithms performed badly alongside abstractive algorithms that were trained using inappropriate data sets.

Automatic vs Human Evaluation. Human evaluation results are similar across the six algorithms. Humans tend to not be extreme and therefore give scores with small variations. The other reason may be the lack of diversity in the data set used in the human evaluation. The metric evaluation scores are highly correlated with one another as they all measure overlap (see Table 4). However, the human and automatic evaluation metrics also have a strong positive correlation, suggesting a good level of agreement with human judgement. The human Error score seems to correlate a little higher with all four automated metrics compared with Coherence score. This may be due to the latter being more subjective. Similar results were obtained when using Spearman’s rank correlation. From this, its concluded that human and automatic evaluation measures correlate quite strongly, at least for the domains that we have studied.

Table 4. Correlation between measures.

We observe that the automatic and human evaluation scores have high correlations, but statistically different means. A two-tailed paired t-test showed that there is a significant difference in the mean of the automatic summarisation metrics (\(M = 0.224\), \(SD = 0.090\)) compared to human summarisation metrics (\(M = 0.781\), \(SD = 0.038\)), \(t(11) = -27.126\), \(p = 0.000\). On average, human metrics result in higher scores than automatic evaluation. The standard deviation of the measures tend to be similar. Taking this into account, in addition to the correlation results, it appears that the automatic metrics correlate to human judgement in a relatively accurate manner.

6.2 Domain Dependence

One of the goals of this work is to establish if summarisation algorithms perform differently depending on the topic of the text. To do this, the algorithms were also run on political articles from the CNN/DM data set, and then evaluated using the automatic evaluation metrics. In general, it seems that the algorithms performed worse on the politics articles compared to sports. The gap is larger with abstractive algorithms. A two-tailed paired t-test showed that there is a significant difference in the performance of the summarisation algorithms on sports articles (\(M = 0.224\), \(SD = 0.088\)) compared to politics articles (\(M = 0.189\), \(SD = 0.072\)), \(t(5) = 4.359\), \(p = 0.007\).

The correlation between each of the six summarisation algorithms’ mean evaluation score for politics articles and sports articles was calculated. All six algorithms were highly correlated according to both Pearson’s and Spearman’s measures as seen in Table 5. This leads to the conclusion that a similar performance pattern for the summarisation algorithms exists in both domains.

The results discussed here lead to the conclusion that the order of algorithms does not depend on the domain, as the best and worst performing algorithms were the same for sports and politics articles. However, there is a significant difference in the actual performance of each algorithm. In general, the algorithms performed considerably better with sports articles according to the evaluation measures used. The abstractive algorithms that were trained on the CNN/DM data set showed the biggest gap in performance between the two domains. This could be due more sports than politics articles being available for training.

Table 5. Correlation between sports and politics articles.

7 Conclusions and Future Work

In the experiments conducted in this paper, the two abstractive algorithms performed better than the two extractive ones, not only with regards to coherence as expected, but also in terms of the automatic evaluation metrics. This was the case in both the sports and politics news domains. It is worth noting that, even if the classification of algorithms is domain-independent, their performance is domain-dependent. That being said, the performance of abstractive algorithms is dependent on the data used in their pre-training. The results of the evaluation metrics in this paper agree with the results from [5]. The only exception here is that, when the abstractive algorithms are pre-trained using an inappropriate training set, then they perform more poorly than their extractive counterparts.

The incorporation of human evaluation allowed us to further understand the effectiveness of the algorithms considered. It was concluded that, due to the high correlation between human evaluation metrics and automatic ones, automatic metrics such as ROUGE and BLEU can provide a good estimation of human judgement. It was also found that the algorithms performed similarly on CNN and DM article sets. However, human evaluation was closer on DM articles by a small margin, possibly due to a user bias in relation to British English. The web application developed as part of this work acts as the first user interface that allows the comparison of sample summaries generated by different summarisation algorithms across different subject domains. The application also enables human evaluation experiments to be conducted, providing a useful tool for researchers working in this area.

This study would benefit from expanding the number of algorithms evaluated, including a larger set of extractive and abstractive techniques. The human evaluation could be improved by adding more participants. In addition, a larger pool of articles would make the results more robust. More detailed aspects can be added to the human evaluation – e.g. by collecting the syntactic types and the severity level of the errors appearing in summaries. This paper was an introduction into testing the generalisation of summarisation algorithms. This could be expanded by looking at the performance of algorithms for more domains, such as travel articles, or by performing within-domain comparisons, such sports news articles covering football versus basketball.