1 Introduction

In our highly digitized world, an enormous amount of information in the form of text is collected on a daily basis. Especially, user-generated data are increasing enormously on the Internet through user reviews, news, blogs, social networks, idea contests, etc. This growth in text data has taken on exponential proportions, with large amounts of redundant information being collected [19].

Due to this overload of data, it is becoming increasingly difficult to pick out relevant information and obtain an overview of a topic. It takes a lot of time and cognitive effort to read and understand all the available content, making it increasingly impossible. People are faced with a lot of irrelevant, noisy and redundant content, so summarizing textual data is becoming more and more important [5] as a way to create a short summary of a single document or a set of documents while maintaining the sense of the content. Although automatic text summarization has always been an important application of natural language processing, it is now becoming more relevant due to the improved effectiveness of pre-trained deep natural language models [16].

Despite these advances, text summarization generally focuses on creating a single generic summary that covers the most common points of one or multiple documents, likely addressing the greatest similarities, as opposed to the greatest differences [17]. In practice, however, multiple redundant ideas, solutions, or different opinions are now produced with user-generated data on a large-scale basis, which motivates the identification of the most important differences as an alternative or complement to common text summarization approaches. This is especially relevant when the content is diverse and varied, making it difficult for humans to effectively process and understand all the distinct information available. By automating the process of comparing and summarizing text, one can possibly extract relevant information more efficiently and gain diverse insights from large amounts of data. For example, user comments on two different accommodations in one tourist destination can be comparatively summarized [25]. Furthermore, the different opinions of several political parties can be summarized comparatively [6, 44]. Therefore, it is of great importance, in addition to detecting the key similarities, to identify the differences and reduce redundancies in the text summaries.

This particular task of natural language processing can be found in the literature under the term contrastive summarization or comparative summarization [23, 25, 34, 48]. Unlike generic summarization, query-oriented summarization, and update summarization [51], contrastive summarization addresses the problem of generating a summary that highlights the differences between multiple texts.

In this paper, we conduct a research overview on the topic of Contrastive/Comparative summarization, which, despite its high importance and many promising applications, no standardized evaluation benchmark has yet been developed [16].

Furthermore, our paper is intended to highlight possibilities and promising research avenues that can still be explored and considered. In particular, our review is structured as follows. Section 2 defines and explains the problem and describes, in detail, the systematic literature conducted in this paper. Section 3 presents the main data sets and evaluation criteria used for contrastive summarization. Section 4 presents and compares the given approaches, methods, and techniques. All presented papers are then classified according to their methods used, while trends and future research directions are also identified. Section 5 introduces the most important applications of contrastive summarization. We will show that the main focus of contrastive summarization is usually to find different opinions on a certain topic, different reviews of a product, or differences in newspapers. Section 6 concludes our research article by highlighting our findings and further research opportunities.

2 Problem definition and literature selection

Contrastive summarization was first mentioned in [34] as the joint creation of summaries for two entities to highlight their differences. In their study, product reviews of 56 electronic products with about 70 ratings per product were used to collectively create two summaries that highlight the differences between two products.

Three years later, [56] narrowed the definition of Comparative summarization to an extractive summarization approach (Comparative extractive document summarization) where the task is to summarize differences between comparable groups of documents. In contrast to [34, 56] comparatively summarizes more than two groups of documents, using a method that extracts the most discriminatory sentences for each group.

More recently, [25] presented this problem as Comparative opinion summarization. In this work, the authors aim to produce two contrastive summaries and a common summary from two different candidate sets of reviews. As an example, several reviews of one hotel are compared with reviews of another hotel. The two contrastive summaries stem from the differences of the respective reviews, while the common summary represents the intersection of the two. Liu et al. [40], on the other hand, and in contrast to [25, 34], defines one-to-many comparative summarization task, where one document is compared to many others.

Our understanding is that contrastive or comparative summarization is an attempt to summarize the text from different documents related to the same theme in such a way that the relevant differences in the text become highlighted. This results in summaries that either describe the differences between different versions of the same document (e.g., the same document with its different versions) or the differences between different documents which discuss the same subject or are comparable in a certain way (e.g., the differences between two political party programmes).

Having this defined, we adopted a systematic literature research that uses the following logical expression as a means to obtain relevant literature to the discussed themeFootnote 1:

$$\begin{aligned} \begin{array}{l} \tiny (\text {comparative } \vee \text {contrastive } \vee \text { compare} \vee \text { contrast } \vee \tiny \text { contrasting} \vee \text { comparing} ) \\ \qquad \qquad \qquad \,\, \tiny \bigwedge \\ \tiny (\text {summarization } \vee \text {summarization } \vee \text { summarizing } \vee \\ \tiny \text { summarising } \vee \text { summarize} \vee \text { summarise} \vee \text { summaries}) \end{array} \end{aligned}$$

To perform the search, we used the dblp computer science bibliography primarily as a search basis and verified it with ACL Anthology and Google Scholar. The different combinations of search queries yielded 138 publications as of April 12, 2023. We then conducted a curation process that involved reading all titles and abstracts. By doing this, we found that some publications include a comparative study or a comparative evaluation or are only about contrastive learning. A related topic of Timeline Summarization was also found along with some non-English language publications. The removal of these publications resulted in a final set of 26 relevant publications, all listed in Table 1. The number of citations was determined using Google ScholarFootnote 2, and the Method column is intended to provide a brief overview of the approach used.

Table 1 shows that in addition to [34], who were the first to address the problem, [30, 44, 56] are the most cited publications, with more than 50 citations. Along with this, we also show the method used in each research paper. A previous survey of this area has been conducted in the past by [32], however, in their work, the authors only focused on Opinion summarization and surveyed only a small subset of 6 publications [20, 30, 34, 43, 44, 53]. Our work extends this survey by including in the analyses a larger number of publications on the topic of Contrastive/Comparative summarization and covers also ones published after 2019.

Before delving into the details of each research work, we proceed by setting up the data sets and the evaluation metrics that are often used in the community for evaluation purposes.

Table 1 Selected literature

3 Data sets and evaluations metrics

Although several research works have been proposed over the years, the lack of standard data sets and/or a competition task dedicated to this topic has limited a comparison of the different contrastive summarization methods proposed so far and makes it difficult to thoroughly understand their strengths and weaknesses, ultimately impeding the emergence of new methods. Despite this, a few data sets have been used and/or adapted, compare Table 2. However, none has established the ground to become a reference data set.

The efforts done by the research community have also led to the emergence of different metrics used to evaluate the proposed methods. The Overall Responsiveness scale, a manual human-based Likert scale rating, is one such metric. In this metric, raters assign an overall score for each summarization based, for example, on content as well as readability [24].

Another measure is the Comparative Aspect Recall scale that is used to measure the effectiveness of comparative extraction, defined as the number of human-agreed comparative aspects in the summary [24].

The Aspect Coverage, on the other hand, appears as an alternative, by measuring the number of unique aspects collected in the summary divided by the number of unique aspects labeled by human annotators [30].

Also, the precision and recall metrics can be adapted to the contrastive realm by counting the number of sentences in the automatically generated contrastive summary and the manually generated one. The F-Measure combines both measures. A formal representation of each of the metrics is given below. Let a and m be the number of sentences in the automatically generated contrastive summary and the manually written contrastive summary, respectively; let c be the number of human-agreed correct comparative sentences in the automatically generated summary, then the precision, recall, and F-measure are given as follows [23]:

$$\begin{aligned} \begin{aligned}&\text {Precision} = \frac{c}{a},\\&\text {Recall} = \frac{c}{m},\\&\text {F} = \frac{2\,\text {Precision}\,\text {Recall}}{\text {Precision}+\text {Recall}}. \end{aligned} \end{aligned}$$

Higher precision means that more correct contrastive sentences are retrieved in the automatically generated contrastive summary, and high recall means that more correct contrastive sentences are retrieved in the summary when compared to all those that were manually labeled as relevant in the reference summaries. The F-measure averages both values, becoming larger when both measures are balanced.

Another standardized procedure for evaluating automatically generated contrastive summaries is the ROUGE Metric toolkit. All ROUGE metrics count the number of units overlapping between the candidate summaries and the reference summaries [38]:

  • ROUGE-N measures the overlap of n-grams between a candidate summarization and a reference summarization, that is, ROUGE-1 measures the overlap of unigrams and ROUGE-2 measures the overlap of bigrams.

  • ROUGE-L measures the longest common subsequence between a candidate summary and a reference summary.

  • ROUGE-W is an improvement of ROUGE-L that weights the measure of longest common subsequence with word order.

  • ROUGE-SU is a combination of Skip-bigram and unigram-based measure.

The main disadvantage of ROUGE metrics, as well as precision, recall, and F-measure, is that they are more suitable for extractive contrastive summarization, since similar synonym words cannot be considered [42]. Summaries that are semantically very similar, but which may not use common words, will then receive a low score. With abstractive summarization, it is not mandatory that the same content is described with the same word phrases, i.e., a small overlap of words to the comparison summary is possible, despite the content being well captured. Thus, in contrast to standardized scoring metrics that rely on word overlap, human-based measures such as Overall Responsiveness can be used for abstractive summaries.

Another development that has only been used by [25] in the contrastive summarization literature is the use of semantic similarities using pre-trained language models to evaluate generated contrastive summaries [41]. Iso et al. [25] compute semantic distances using a pre-trained BERT model.

In the following, we describe in detail each of the methods used by different research works considered in this study to perform a contrastive summarization.

4 Contrastive summarization methods

As with automatic text summarization, contrastive summarization can be divided into the following approaches: Extractive, Abstractive, or Hybrid. In the extractive approach, the important sentences are selected from the input text and used to create the summary. In the abstractive approach, a summary is created using words and phrases that can be different from the original text sentences. Finally, the hybrid approach combines both extractive and abstractive methods. In the context of Contrastive summarization works, all 26 publications studied in this survey, except [25], follow an extractive approach. Extractive text summarization approaches can be further subdivided according to their methods that will guide us through the rest of this chapter.

We refer to optimization-based, topic-based, statistical, machine learning-based, neural network-based, and graph-based methods. Most publications use a combination of these methods to achieve the best possible result, which we now present in more detail.

Table 2 Data sets, metrics and applications

4.1 Optimization problems

The extractive summarization task can be defined as an optimization problem, where sentences are transformed into vectors with many statistical or linguistic features and are selected through multiobjective optimization functions.

Lerman and McDonald [34]’s work is based on an optimization problem called the Sentiment Aspect Match model, which attempts to summarize contrastiveness based on the sentiment scores of different user reviews of electronic products with a pre-specified length constraint. To evaluate the contrasting summaries, the authors asked 55 online reviewers to find differences between two contrasting summaries and to identify the usefulness of these differences, giving three ratings for each summary on a four-star Likert scale. The results show that about 80% of the raters were able to find at least two points of contrast in the summaries generated by the contrastive summarization, compared to 40% for the summaries generated by the simple Sentiment Aspect Match model, meaning that the contrastive summarization clearly outperforms the single product summary.

In addition, [52] use sentiment scores in their optimization framework, which highlight differences between entities in an opinionated text and satisfy the following three characteristics: Representativeness (presence of opinions that are common in the input), contrastiveness (presence of opinions that highlight differences between entities) and diversity (presence of different opinions to avoid redundancy). A specially developed score (from 0 to 100) for representativeness, contrast and diversity is used to validate their contrastive summaries of consumer reviews. It is compared with [30, 34], whose method achieves an overall score (harmonic mean of all three measures over all data sets) of 82, compared with 59 [30, 34], respectively.

Instead of sentiment scores, [30] apply word overlaps and semantic word matches to define content similarity and contrastive similarity within the context of optimization problems. They use existing product review summaries from [22] and two human reviewers to identify representative pairs of contrastive sentences from this data. The results of their experiments show that the proposed methods are effective in producing contrastive opinion summaries, with the contrastivity-first (0.54 precision, 0.80 aspect coverage) approach performing better than the representativeness-first (0.50 precision, 0.74 aspect coverage) approach.

In [23], a linear optimization problem with term frequency and inverse document frequency is used, which optimizes between the comparativeness within the summary and the representativeness of the topics. Singular sentences with the most different vocabulary from all sentences in the other documents are selected, without considering synonyms. They collected their own data set consisting of articles of different newspapers on comparable topics and used precision, recall, F-measure, and ROUGE metric toolkits to validate their comparative summarization approach. In summary, their linear programming-based comparative model (0.36 precision, 0.42 recall, 0.39 F-measure, 0.43 ROUGE-1), which focuses on both comparability and representativeness at the same time, achieves better performance in both comparison extraction and summarization than the self-defined non-comparative model (0.24 precision, 0.26 recall, 0.25 F-measure, 0.40 ROUGE-1) and the co-ranking model (0.31 precision, 0.29 recall, 0.29 F-measure, 0.43 ROUGE-1).

Similarly, [56] use a combinatorial optimization problem designed with the use of document-sentence representation: Sentences are considered as features, and the challenge of selecting discriminant sentences is formulated as a sentence-based feature selection problem. The combinatorial optimization problem is estimated with a multivariate normal generative model and the developed sequential selection method. Unlike [23, 30, 34], the features are no longer based on single words, but on entire sentences, taking into account the sentence–document and sentence–sentence relationships. The authors used blog entries without comments and abstracts from computer science research papers as a data set to create groups from the data and identify differences between them. The ROUGE metric is used to validate their three types of contrastive summarization with human-generated summaries. Their approach yielded a ROUGE-1 of 0.53 in the research papers, compared to seven other approaches that fall within the ROUGE-1 range of 0.21–0.32. The experiments demonstrate the effectiveness of their proposed method, which benefits from the document-sentence representation.

In the work of [53], a vector of features with bag-of-words and tf-idf scores is used to formulate an optimization problem that generates short and comparative summaries based on product reviews. The authors manually labeled their data for supervised learning and validation. In summary, their approach selects pairs of review snippets in such a way as to produce a summary comparison of the product that outperforms a naive self-defined baseline solution by more than 20%.

Huang et al. [24] employ a concept-based optimization approach, which uses cross-topic pairs of semantically related concepts as evidence of comparability and topic-related concepts as evidence of representativeness. The similarities of the tf-idf measure based on whole sentences and sparse one-hot encodings are used in the optimization. The authors use the same data set as [23] and rate each summary with a Five-Star Likert scale for both content and readability/fluency. In addition, their contrastive summarization approach is validated using the ROUGE metric. The authors show that comparative analysis of related news topics is useful in many applications and that with the use of linear programming, their model (3.5 Comparative Aspect Recall, 3.4 Overall Responsiveness, 0.27 ROUGE-2 in English data set) outperforms the self-defined non-comparative model (1.9 Comparative Aspect Recall, 2.6 Overall Responsiveness, 0.22 ROUGE-2) and the co-ranking model (2.3 Comparative Aspect Recall, 2.9 Overall Responsiveness, 0.22 ROUGE-2) in comparative extraction and summarization.

Finally, in [43] a max sum optimization problem is formulated. Word-level differences are determined by cosine similarity with tf-idf and are used in summed form for whole sentences and documents. This approach attempts to determine a list of pairs of the most representative sentences related to a given aspect, where each pair contains a positive and a negative sentence that have contrasting meanings, e.g., “The design is really well done.” as a positive sentence and “But my biggest criticism is still the extremely ugly design.” as a negative sentence. The same English data set as [30] and a self-created Turkish data set on user reviews are used for validation and comparison with [30]. Their approach obtained 0.65 precision and 0.89 aspect coverage using term frequency and \(\lambda = 0.80\) versus 0.50 precision and 0.74 aspect coverage [30]’s representativeness-first approach and 0.54 precision and 0.80 aspect coverage [30]’s contrastivity-first approach. They obtained better results and observed that using cosine similarity with term frequency for computations performed better than using tf-idf in their max sum optimization.

4.2 Topic models

The use of topic models as a means is also a widely adopted approach by the research community. In general, works that implement such a technique aim to identify the topic of a document more precisely before selecting the desired sentences based on these topics.

One of the most popular methods in this regard is the Topic-Aspect Model, an extension of Latent Dirichlet Allocation (LDA). It is a Bayesian mixed model that jointly discovers topics and aspects. In [44], the Topic-Aspect Model is used as the first step to build multiple topics and extract viewpoints in combination with a Comparative LexRank, an adaptation of the PageRank algorithm, to contrast these viewpoints. The Gallup® telephone survey on the 2010 US health care bill [29] and the Bitterlemons corpus [39] are used as pre-existing sources in their contrastive summarization approach. They used the ROUGE metric to validate their contrastive summaries and compared it to a standard LexRank summarization approach and to [34]’s approach. Their approach yielded 0.43 ROUGE-1 without stop words compared to 0.36 for the standard LexRank summarization approach and 0.35 for the [34] approach. The results show that their method outperforms both comparison approaches, and thus, their approach can produce more informative summaries of viewpoints in opinionated texts.

In addition to the Topic-Aspect Model, [37] also use various graph-based centrality scoring approaches, namely the basic LexRank, Comparative LexRank, Topic-sensitive tf-idf LexRank, Topic-sensitive tf-idf & Comparative LexRank, and Biased & Comparative LexRank, to rank the centrality of each sentence. The sentences with the highest centrality values for each topic are selected to form the contrastive summaries. They, like [44], use the same Gallup® telephone survey [29] to build contrasting viewpoints and summaries and use the ROUGE metric for validation. Empirical experiments show that all proposed methods have similar ROUGE-1 precision values (ranging from 0.08 to 0.11) and can effectively perform the summarization task, especially the proposed topic-sensitive tf-idf & Comparative LexRank method could be used for both multi-topic summarization and contrastive viewpoint summarization with high performance.

In [7, 8], Latent Semantic Analysis and Latent Dirichlet Allocation are used to determine the different topics of the documents. These topics are then compared, and the main differences are used to select the most important different sentences to build the contrastive summaries. A specially created data set with data from the TAC 2011 conference containing news articles is used in combination with the ROUGE metric for evaluation. The average scores for both algorithms—LSA and LDA—show comparable results (ROUGE-1 recall and precision for LDA just over 0.35, ROUGE-1 recall just under 0.35 and precision over 0.4 for LSA), with LDA providing better recall results, but LSA providing better precision.

Ren and de Rijke [49] present a three-step approach: A hierarchical sentiment Latent Dirichlet Allocation is used to model contrastive topics, which are filtered in a structured determinantal point process to the most diverse topics and used in an iterative optimization algorithm that selects sentences with explicit consideration of contrast, relevance, and diversity to form the contrastive summary. The approach is compared with several other topic models, namely the Topic-Aspect Model [44], the Sentiment-topic Model [35], the Latent Dirichlet Allocation [3], and the hierarchical Latent Dirichlet Allocation in combination with summarization approaches such as LexRank and clustering-based sentence ranking. Like [44], they use the same Gallup® telephone survey [29] and add extracted news articles from the New York Times to build contrastive summaries across topics. The ROUGE metric is used to validate their three-step approach and shows that the contrastive summaries produced, which meet the three main criteria of contrast, diversity, and relevance, demonstrate the effectiveness of the proposed method through significant improvements over the three manually annotated data sets. Their approach yields a ROUGE-1 of 0.4 for the Healthcare Corpus compared to 0.4 for the Topic Aspect Model and 0.31 for the Sentiment-Topic Model.

A semi-supervised Probabilistic Latent Semantic Analysis model is used in combination with a sentence selection strategy that uses a contrastive similarity measure that indicates how well two sentences with opposing opinions match [20]. Taking into account prior information from experts, the topic model groups the arguments. Guo et al. [20] created their own data set on controversial topics and used the precision and coverage measure to evaluate their contrastive summarization approach and compare it to the ones of [30, 43]. Their model (precision of 0.6 and coverage of 0.67) outperforms [30]’s approach (precision of 0.2 and coverage of 0.17) and the one of [43] (precision of 0.2 and coverage of 0.33) in terms of precision and coverage.

The focus of [21] is a differential topic model dTM-SAGE with a sentence scoring method that measures the discriminative power of sentences to summarize the differences between groups of documents. The topic model is used to obtain deviations in group-specific word distributions to indicate how words are used differently in different document groups from a background word distribution. They used 35 papers with 6636 sentences from the ACL Anthology Searchbench for their contrastive summarization approach, which is validated with the ROUGE metric. In their evaluation, they significantly outperform generic baseline summarization approaches (ROUGE-1 0.42) such as the centroid-based method [47] (ROUGE-1 0.23), the graph-based method LexPageRank [47] (ROUGE-1 0.25) and the MMR-based method [9] (ROUGE-1 0.28), as well as two contrastive summarization approaches, [56] (ROUGE-1 0.31) and [51] (ROUGE-1 0.32).

Lavanya and Parvathavarthini [31] use a context-sensitive PLSA model with initial linguistic rules based on dependency relationships to extract context-feature-opinion phrases and then, automatically cluster the extracted context-feature-opinion phrases into contrastive summaries. In their work, the same product review data set as in [30] and the benchmark car data set [18] are used to extract context-feature-opinion phrases and automatically group them into contrasting arguments. The ROUGE metric is used as a validation metric to compare their approach (ROUGE-1 0.33) with [20] (ROUGE-1 0.31), which achieved similar results.

4.3 Statistical approaches

Purely statistical approaches compute statistical and/or linguistic characteristics and their weights for sentences or words, and then, select the most important words or phrases based on these characteristics. In these scoring algorithms, additional attempts are made to explicitly extract orthogonal sentences to represent the most discussed items.

References [11, 48] both use a statistical sentence weight calculation to compute a comparative summary, where [48] use Kullback–Leibler divergence and a bag-of-words model to quickly isolate interesting opinions and provide analyst feedback on how users generally feel about a given topic. [48] apply their summarization approach to the content and commentary of news stories, but do not provide a detailed description of the data set or any indication of validation of the method, nor have they made comparisons with other methods.

Chitra et al. [11], on the other hand, generate comparative summaries from a set of URLs using the HTML DOM tree structure of these web pages and using feature keywords to score sentences. A collection of 200 web documents related to educational institutions, algorithms, banking, and household items is collected to contrastively summarize these web pages. They use a five-star Likert scale to measure the overall responsiveness of their contrastive website summaries and show that their system reduces the time and effort required for the user to browse different websites to compare information.

4.4 Machine learning approaches

The advent of machine learning and neural networks has also reached contrastive text summarization.

For example, [27] use a sparse predictive classification approach that automatically labels text units for a given topic, pre-processes the possible summarizing phrases and phrase counts, and sparsely selects a contrastive phrase list of interest using Lasso and \(L^1\) penalized logistic regression on automatic labels. Each contrastive summarization approach is applied to the set of articles in the New York Times. They compared their four feature selection methods in a crossed and randomized experiment in which non-experts read both the original documents and their summaries and rated the quality and relevance of the results using a five-star Likert scale. Based on their human experiment, they concluded that features, such as Lasso or tf-idf, selected using a sparse prediction framework, can generate informative summaries of keywords for topics of interest.

On the other hand, [1, 2] define extractive contrastive summarization as a binary classification problem using the maximum mean discrepancy in combination with a gradient optimization method. In this process, for given groups, the algorithm learns to select sentences that represent each group, but also to highlight differences between groups. References [1] and [2] use news articles on controversial topics to comparatively summarize news articles from different time periods to determine what changed about the topic between summarization periods. In a human evaluation setting, where crowd workers are given some contrasting summary articles from two groups and asked to classify them into one of two groups, the comparative summarization is then evaluated with this resulting accuracy value. They found that the gradient optimization summaries were 7% more accurate in classification than discrete optimization. References [1] and [2] use not only machine learning approaches but also pre-trained GLOVE vectors to represent documents in their comparative summarization through the binary classification method. In recent years, there has been a trend towards using or fine-tuning pre-trained language models, which we present in the next chapter.

4.5 Deep learning and transformer language models

A breakthrough of neural networks in text summarization was achieved by the invention of word embeddings and transformer language models. Word embeddings such as [45] but also [12] are representations of words and phrases that are mapped to a numerically dense vector that captures their semantic meaning and context. Transformer, which is a neural network architecture introduced in 2017 by [54], has led to significant improvements in tasks such as machine translation, language modeling, and summarization. Many well-known language models such as BERT [12], but also GPT-3 [4] are based on this architecture.

Lavanya and Parvathavarthini [33] train a long short-term memory on pre-trained Word2Vec embeddings with attention mechanisms such as feature attention, opinion attention, and context attention to automatically build context-sensitive contrastive summaries. Context-sensitive sentiment classification using a soft-max classifier is used to identify and present contrasting summaries from a given set of positive and negative summaries of two entities. They used the SemEval 2014 Task 4 restaurant reviews data set [46] to train the context-sensitive contrastive opinion summarization model. The precision measure and the ROUGE metric are used to validate their approach, and the experimental results show that the proposed model achieves better or similar performance (ROUGE-1 0.36) than the baseline models, such as [58] (ROUGE-1 0.36) and [50] (ROUGE-1 0.32).

A Word2Vec model is trained from scratch on a large patent data set to determine the vector representations of [40]’s patent vocabulary, and [10] use a pre-trained BERT model to represent the scientific vocabulary. Both approaches are described in more detail in the graph-based subsection.

None of the above methods, however, uses the language models for abstractive summarization, only for extractive summarization. On the other hand, the COCOSUM [25] comparative summary system is a fine-tuned language model that produces abstractive contrastive summaries. This work clearly distinguishes itself from the others, as it highlights not only what is different but at the same time what is common, and it uses an abstractive approach. Iso et al. [25] created a comparative opinion summarization corpus containing human-written contrastive and common summaries of hotel reviews. The ROUGE metric is used for the validation of the contrastive summary of hotel reviews, and the experimental results on their created benchmark data set show that COCOSUM can produce higher-quality contrastive and common summaries than two extractive and four abstractive opinion summarization models. All variants of their method have a ROUGE-1 of more than 0.4 versus the range of 0.32–0.37 for the baseline methods and a BERT score of more than 0.29 versus a range of 0.2–0.24.

4.6 Graph-based methods

In graph-based models, documents are represented as graphs based on their sentences. After a long period of obscurity [37, 44], they have recently experienced a somewhat renaissance with pre-trained language models. Two works stand out here, [10, 40] both have developed graph-based comparative summarization methods using pre-trained language models as a basis.

Liu et al. [40] use the vector representation of patent vocabulary given by the Word2Vec model as features in a multi-relational graph to generate contrastive summaries. They created a data set of real-word-linked patents to create contrasting patent summaries. Precision, recall and F-measure are used to validate their approach (Results for the mechanical engineering data set: 0.68 precision versus range 0.10–0.23 for baseline approaches, 0.59 recall versus 0.44–1.00, 0.63 F-measure versus 0.16–0.35), with experimental results and detailed analysis in the case study showing the effectiveness of the proposed framework.

Similarly, [10] use a pre-trained BERT model to generate comparative summaries using a comparative graph-based summarization method that uses citations as a guide. The ROUGE metric is used for the evaluation on a specially created corpus of scientific articles with linked citations. Experiments show that their method (ROUGE-1 0.49) outperforms nine single and multi-document summary methods (ROUGE-1 ranges from 0.23 to 0.43) in their own corpus and also performs well in DUC2006 (ROUGE-1 of 0.40 against a range of 0.32 to 0.41) and DUC2007 (ROUGE-1 of 0.41 against a range of 0.33 to 0.43).

In addition to plain text, both approaches [10, 40] use linking data, such as those found in patents or scientific publications, to connect and contrast content. The next section describes the applications used in contrastive summarization approaches.

5 Applications on contrastive summarization

Contrastive summarization is used in many areas where different and controversial contributions to a particular topic or problem are collected and need to be distinguished. This motivated us to investigate which applications are considered and aimed for in contrastive summarization, and by that inspire future applications.

Overall, we found that Contrastive Summarization is used for opinions or sentiments on an issue, reviews, content in blogs or tweets, websites, patents, news articles, or even scientific research articles. References [1, 2, 7, 8, 23, 24, 27, 48, 49] used their approach to analyze newspaper articles on controversial topics. In addition to newspaper articles, [20, 25, 30, 31, 33, 34, 37, 43, 44, 48, 52, 53] used user-generated data as an application of their method. They tried to contrast and summarize opinions, comments, arguments, and reviews on controversial topics or products. References [20, 56] crawl their user-generated data from blogs and Twitter to contrast and summarize controversy and various arguments.

Contrastive summarization can also make use of more complex data structures, such as those found in patents, web pages, or scientific articles. Chitra et al. [11] use the HTML DOM tree structure to summarize multiple web pages comparatively. Liu et al. [40] utilize the underlying network resulting from the link data structure of patents, and [10] use the citation structure of scientific articles in addition to the link structure of patents. Without using the citation link structure, [21] also uses scientific articles as an application of their approach.

6 Conclusion and outlook for further research

Our survey shows that the problem of contrastive or comparative summarization has been known for a long time and has attracted a great deal of interest from the research community. As in the evolution of language models, where rule-based systems were replaced by statistical ones, followed by a neural revolution [28], contrastive summarization systems also faced a similar evolution: In the early days of this research, contrastive summarization was always considered as an optimization problem, where representativeness, contrastivity, and diversity were the constraints for extractive sentence selection. Due to the statistical revolution of generative statistical models such as LDA, topic generation was very much in vogue at that time. Bayesian methods and expectation maximization were also used to detect contrastive themes in a text. Meanwhile, these have been replaced by machine learning approaches, in particular, deep learning methods, with pre-trained language models increasingly forming the basis of contrastive summarization. On the basis of this, another trend has emerged that involves the use of additional complex data structures linking the documents. Patents, scientific articles, or websites are connected by citations or hyperlinks in addition to raw text, enabling the use of graph-based methods. However, the development of contrastive summarization methods must take into account the genre of the text, as different genres pose different challenges. For example, patents and scientific articles tend to be very technical and formal, while tweets and comments on forums often adopt a colloquial nature. This is something acknowledged by [55] who recognize in their work that different techniques may be applied depending on the type of text.

Several efforts have also been made by researchers for evaluation purposes. However, the lack of standard data sets and the fact that the task of contrastive summarization differs on the basis of its application have led to the emergence of a multitude of data sets, most of which were created by researchers themselves. To further popularize the task and provide grounds for fair method comparison, competitions with standardized data sets are expected to emerge.

Standardized measures such as accuracy, recall, F-measure, and ROUGE, in addition to human-based Likert scales, were also used to evaluate contrastive summarization systems. These, however, are more appropriate for extractive approaches. In the future, human-based metrics or semantic distances using pre-trained language models [25] should be used when evaluating systems that stem from abstractive approaches as means to correctly evaluate sentences that have the same semantic meaning but are described by synonyms. There is a need for evaluation metrics that correctly evaluate sentences that have the same semantic meaning, but are described by synonyms.

Finally, there is a body of research work that incorporates temporal aspects in contrastive summarization, which was not covered in this survey. These are either research works that compare content published at different time periods [13, 14, 26] or that conduct comparative timeline summarizationFootnote 3 [15]. Temporal comparative summarization is challenging and poses unique challenges as it needs to consider other issues such as event ordering and co-reference, general/background context change of different times, vocabulary semantic drift, fragmentary or missing documents especially when considering distant time periods, or even different OCR result quality. This is something researchers should look at in the future.

Parallel to this, we envision that in the near future, modern pre-trained language models, which have shown strong effectiveness in abstractive summarization benchmarks [36], will be extended to contrastive summarization, which has always been considered, with the exception of [25], as an extractive approach.