1 Introduction

Argument mining (AM) has become an established field of research in Natural Language Processing (NLP) with numerous works published over the last years [8, 12, 16]. AM is used with growing success to automatically detect argumentative structures in textual discourse, including student essays [8] and web forums [11]. Argumentative structures which can be automatically resolved include claims [6] and premises, argument relations [8], or pro- and con-arguments [22]. As such, AM can be used to support decision making by retrieving the most important arguments for and against controversial matters.

The current contribution details how we addressed the challenging task of argument search in heterogeneous data in the ArgumenText project.Footnote 1 ArgumenText has pioneered the generalization of AM at sentence level and created important resources for both argument classification and argument clustering. To achieve this goal, we had to overcome several research challenges:

  1. (1)

    Generalizing AM to heterogeneous sources (e.g. news as well as web content): to extract relevant arguments from all sources available, we need to ensure that the AM model is able to detect arguments from any type of text.

  2. (2)

    Scaling AM technology to big data (e.g. millions of web pages): to be able to work on large datasets or data streams, the extraction must be fast as in other information retrieval (IR) scenarios.

  3. (3)

    Clustering similar arguments (if they refer to the same reasoning): to better present long lists of arguments to users, it is necessary to detect similar and dissimilar arguments.

In the following, we explain those challenges in more depth and show how they were solved in the ArgumenText project.

Table 1 Argumentative search engines. Sources: document collections from which the arguments are extracted; Argument Classification: argument detection at query (online) or indexing (offline) time.

2 Argument-based Search

AM offers the perfect ground to combine machine learning with human decision making, as it is supposed to detect viewpoints (in the form of argumentative structures) using machine intelligence. Given a controversial topic (e.g. “wind energy”) and a large enough text collection to search in (e.g. a web crawl), the ideal AM system should be able to extract all relevant reasoning from previous debates about the topic of interest. For example, AM-supported decision making has been investigated in the context of evidence-based reasoning [19], where AM is used to detect and distinguish kinds of evidence with applications in the medical domain [14].

Given the subjective nature of evidence evaluation [1], initial applications of AM-supported decision making quickly converged into the creation of argumentative search engines [26]. Inspired by manually curated online debating portals such as kialo.com, procon.org or idebate.org, this line of research frames automatic AM as a retrieval task, aiming to maximize the relevance of search results with respect to the input query [17]. As opposed to standard web search engines, argumentative search engines need to detect the most relevant arguments given query term(s) and document collections to search in. Most approaches divide arguments into statements supporting (pro) or attacking (con) the input query, motivated by the goal to avoid biased or one-sided retrieval [2]. Arguments themselves are typically defined as “text expressing evidence or reasoning that can be used to either support or oppose a given topic” [22] – where the topic may be equal or highly relevant to the input query. The first case, in which arguments are classified as such based on the query itself, can be referred to as online argument classification [2]. In the latter case, arguments are classified regardless of the query (i.e. offline). Table 1 lists recently proposed argumentative search engines. We only list search engines dividing arguments by (binary) stance (pro vs. con). Further AM-driven search engines such as MARGOT [13] and TARGER [5] provide online interfaces for argument tagging, i.e. argument component detection on token-level [8]. Fig. 1 shows the ArgumenText search engine results for the query “wind energy”.

Fig. 1
figure 1

The first few hits for the search query “wind energy” as displayed by the argument search engine ArgumenText. ArgumenText ranks arguments by the confidence score of its argument extraction algorithm [21]

The ArgumenText search engine was created as part of our effort to demonstrate the applicability of AM-driven approaches to decision-making, in particular to unrestricted and unstructured text collections. To date, ArgumenText is the only publicly available argumentative search engine retrieving English and German arguments in real-time from completely uncurated web sources (see Table 1). Recent work in the context of the IBM Debater project [10] presents a similar system extracting arguments from news and Wikipedia articles – however, they only release the Wikipedia portion of the dataset and do not offer a public search engine. The methodological details of our proposed solution to this problem are described in Sect. 3. For the project goal of testing the usefulness of AM technologies for real-world applications (see Sect. 1), we needed to go beyond argument search and develop end-user applications. This resulted in two additional requirements: the technology needed to be able to work on dynamic data streams (e.g. social media) and the clustering of recurring arguments (to reveal and quantify reasoning strategies for given topics). These two challenges are detailed in Sects. 4 and 5.

3 Extracting Arguments from Heterogeneous Sources

Early work on AM in NLP research used highly structured argumentation schemes to parse argumentative discourse [16, 20]. These argumentation schemes make rather strong assumptions on the argumentative nature of the input documents they can be applied to; e.g. the claim-premise scheme proposed by [20] relates premises (evidence) to claims which in turn refer to major claims. While it has been shown that such discourse-level approaches to AM can also be applied to web data [11], it remains doubtful whether they can be reliably applied to certain kinds of user-generated web content such as customer reviews [15]. Furthermore, for the purpose of training deep learning models, it is also necessary to collect large amounts of training data, which is much more difficult for fine-grained hierarchical schemes as the one proposed by [20]. We also found that often the major claim or even the claims themselves are not given explicitly, but must be inferred from the context or by using world knowledge. For example, an argument explicitly attacking coal energy could also serve as a supporting argument for wind energy implicitly.

As a remedy to this, [22] suggest information-seeking AM, which is “general enough for use on heterogeneous data sources, and simple enough to be applied manually by untrained annotators at a reasonable cost” [22]. The work shows that reliable annotation via crowdsourcing and automatic inference across eight topics is possible, when using a given controversial topic (e.g. “minimum wage”) to classify isolated sentences into either non-, pro-, or con-argument. The resulting dataset is released as part of the ArgumenText project.Footnote 2 Training and inference is performed by a Contextual BiLSTM architecture (“biclstm”) which integrates the information about the topic into some of the LSTM gates, such that a sentence and topic can be processed jointly. Another advantage of the simpler annotation scheme is that the training data which was originally created on English sources can be translated into other languages using state-of-the-art machine translation (as exemplary shown for German by [23]). The translated data can then be used to directly train a model in the target language, which has been recognized as a very efficient way to create cross-lingual models for AM [9].

Our later work on argument classification [18] shows that the biclstm approach of [22] is largely outperformed by a transformer-based architecture using contextualized BERT-large embeddings [7]. In [21], we showed that when training on a larger set of topics, the performance of the sentence classification into non-, pro-, or con-argument can be further improved. We further showed that this kind of argument classification can also be performed on word level, allowing to decompose sentence-level arguments into more fine-grained units [24]. This approach requires token-level annotations for training a sequence labeling method, which we also release as part of the ArgumenText project.Footnote 3

Fig. 2
figure 2

Overview of the ArgumenText service infrastructure. The document storage (left) can process and store content from static or dynamically growing document collections. The core components (middle) are responsible for argument processing and storage. Two graphical interfaces allow to interact with the system (right)

For the public version of the ArgumenText search engine, we indexed more than 400 million English and German web pages from the CommonCrawl project and segmented all documents into sentences [21]. For English and German queries, the system first retrieves a limited number of relevant documents ranked by a BM25 score, and second classifies all sentences from these documents with the above described classifier. Only arguments which have been identified as pro- or con-arguments are displayed and ranked by classifier confidence. Using this two-stage approach for argument search in heterogeneous sources, the ArgumenText system yields a coverage as high as 89% when comparing top-ranked search results to expert-curated lists [21].

4 Scaling AM to Big Data

The ArgumenText search engine described in Sect. 3 extracts arguments from a static web crawl. To be able to validate the technology beyond generic argument search, we built a service-oriented infrastructure around the core components. In particular, we wanted to be able to extract arguments from any given source, including arbitrary document collections specified by end users. For that purpose, we decoupled argument classification from document retrieval and wrapped it as service available via REST APIs.Footnote 4 This service accepts arbitrary textual input and – given a topic which is used to decide on the argumentativeness of the sentences – returns sentence-level arguments from that input.

As direct queries to the REST APIs can only process a limited number of documents in order to prevent timeouts, we connected the argument classification API with a queuing functionality which handles query monitoring and execution in the background. The queuing component is connected to a graphical frontend which records search queries by registered users and pulls novel arguments periodically from the queue. The overall infrastructure is illustrated in Fig. 2. Fig. 3 shows the output of the graphical frontend for the query “e-scooter”, as extracted from a web crawl.Footnote 5

Fig. 3
figure 3

Excerpt from the ArgumenText dashboard. The argument graph for the topic “e-scooter” reveals an initial positive trend in June 2019, which turned negative in later months. Green and red bars indicate the number of pro and con arguments on the time axis

Fig. 4
figure 4

Word clouds and example arguments for three exemplary clusters for the topic “abortion”. a “Fetuses are incapable of feeling pain when most abortions are performed.” b “Abortion is the killing of a human being, which defies the word of God.” c “Allowing abortion conflicts with the unalienable right to life recognized by the Founding Fathers of the United States.”

5 Argument Clustering

Arguments retrieved from multiple sources as in the above described scenarios often repeat similar reasoning. For example, on the topic of “nuclear energy”, arguments referring to the problem of radioactive waste (an argumentative aspect) can be phrased in many ways. While it can be insightful to compare multiple instances of arguments from the same argumentative aspect, smart AM decision-supporting systems should provide end-users with argument clusters rather than unsorted lists of arguments. Multiple lines of research have addressed this problem, including unsupervised learning of semantic similarities of arguments [3, 27].

However, as we have shown in [18], unsupervised methods are outperformed by supervised methods for the task of argument similarity assessment. Unsupervised learning methods rely on semantic overlap between pairs of arguments, which is not ideal for arguments that already discuss the same topic. Instead, we propose to train dedicated argument similarity models to provide similarity scores for the clustering approach. For this purpose, we released a corpus of sentence-level argument pairs extracted from heterogeneous web sources across 28 topics (ASPECT corpus).Footnote 6 The pairs were annotated on a range of three degrees of similarity, according to their overlap with regard to the argumentative aspect they address. Following the experiments described in [18], we only distinguish between related and unrelated arguments which enables to evaluate similarity prediction methods with F1 scores. The best supervised model (fine-tuned BERT-base) performs almost 10pp better than an unsupervised model based on BERT embeddings. Using agglomerative hierarchical clustering with stopping threshold, we are able to aggregate all arguments retrieved for a topic into clusters of aspects. Fig. 4 visualizes three example clusters that were produced using the above procedure.

6 Applications

We identified two promising applications for AM in supporting decisions: innovation assessment and advanced customer feedback analysis.

Technology and Innovation Assessment: Innovative technology often goes along with overly positive reasoning (“hype”) at an early stage, such that it is difficult to identify potential risks. AM-based decision support can help this dilemma as it seeks to retrieve a balanced representation of supporting and attacking arguments on early or more mature innovative technologies. When applied to real-time news collections reporting about innovation and technology (e.g. online magazines), AM can help taking smarter investment decisions. Furthermore, novel trending aspects can be detected and quantified early on, using a combination of the technologies described in Sects. 4 and 5.

Advanced Customer Feedback Analysis: Companies with a broad product range in the consumer sector are often unable to accurately evaluate the large amount of customer feedback on different products and from multiple channels. Existing automatic methods to analyze the customer feedback rely on sentiment mining or unsupervised methods (clustering). While sentiment analysis might be able to separate positive from negative feedback or to distinguish degrees of criticality, it cannot reveal reasons behind the feedback which would be helpful for product development. Thus, the AM technologies as explained in Sects. 4 and 5 can be used to discover and quantify problematic aspects of existing products, to increase product-market-fit and decrease time-to-market.

7 Future Directions

We presented challenges and solutions for AM-based decision support in the context of the ArgumenText project. Some remaining open challenges include:

  1. (a)

    Sorting arguments by quality: Current argument search engines rank arguments by classifier confidence or by IR-based ranking functions. However, end users might prefer arguments of high quality [25] over arguments with high relevance to search query.

  2. (b)

    End-to-end argument clustering evaluation: A large-scale benchmark dataset which contains sentence-level arguments for multiple topics and further groups them into subtopics is urgently required.

  3. (c)

    Labeling argument clusters: Interpreting clusters is a difficult task which can be approximated by specifying predominant word lists (e.g. using LDA) or word frequency clouds. However, to clearly identify and label argument clusters, dedicated methodologies to extract aspect identifiers are required.