In this section, we give a brief overview of the results that we have achieved since we started working on the modules of the envisioned CAM system in late 2017.
Comparative Argumentative Machine (CAM)
We have developed a prototype of the CAM system  that can be accessed online.Footnote 4 The system takes as input two target objects and an optional list of comparison aspects (i.e., no natural language question, yet) and then retrieves sentences supporting either of the objects with respect to the given but also some further automatically identified comparison aspect(s) (e.g., “Python is better than PHP for web development.”). The answer is then presented in form of the retrieved supporting sentences for the two objects and an overall “score” showing which object is favored in the retrieved sentences.
The CAM system has the following components.
Sentence retrieval: the input query (objects and aspects) is run against an Elasticsearch index of the Common Crawl-based DepCC  (14.3 billion linguistically pre-processed English sentences).
Sentence classification: a classifier  maps the retrieved sentences to one of four classes: the first object from the user input is better/equal/worse than the second one, or no comparison is found.
Sentence ranking: the retrieved sentences are re-ordered by descending products of the classification confidence and the Elasticsearch retrieval scores.
Aspect identification: up to ten additional aspects are automatically identified, even when no comparison aspects are provided by the user, by searching for (phrases with) comparative adjectives/adverbs and hand-crafted patterns like “because of higher …” or “reason for this is …”.
User interface: keyword boxes as input form and an answer presentation component (cf. Fig. 1).
We compared the CAM prototype to a “classical” keyword-based search system in a user study that asked participants to answer comparative questions. The results showed that the CAM users were 15% more accurate in finding correct answers about 20% faster (for more details, see our respective paper ).
In the current CAM prototype, the sentence classifier is pre-trained on sentences from only three domains: computer science, brands, and misc (books, sports, animals, etc.). Further diversifying the training domains is thus one idea to improve the prototype while another rather “obvious” important step is to allow for natural language questions as inputs and not to require the objects and aspects to be given in separate fields. Finally, an important direction for future improvements is the identification of answer sentences that are more argumentative and a “real” summarization of the answer as one coherent and concise text fragment. We have already started with some further steps into these directions that are presented in the next sections.
Argument Mining and Retrieval with TARGER
To identify more “argumentative” sentences (or even documents) for the CAM answer, we have developed TARGER : a neural argument tagger, coming with a web interfaceFootnote 5 and a RESTful API. The tool can tag arguments in free text inputs (cf. Fig. 2) and can retrieve arguments from the DepCC corpus that is also used in the CAM prototype (cf. Fig. 3). TARGER is based on a BiLSTM-CNN-CRF neural tagger  pre-trained on the persuasive essays (Essays) , web discourse (WebD) , or IBM Debater (IBM)  datasets and is able to identify argument components in text and classify them as claims or premises. Using TARGER’s web interface or API, researchers and practitioners can thus use state-of-the-art argument mining without any reproducibility effort (for more details on the implementation and effectiveness, see our respective paper ).
Re-Ranking with Argumentativeness Axioms
To examine the effect of argumentativeness for search, we have experimented with re-ranking results based on their argumentativeness and credibility that are captured via respective preference-inducing axioms (i.e., retrieval constraints for pairs of documents). The argumentativeness axioms use TARGER to tag arguments as premises and claims and then re-rank the top-50 BM25F results with respect to several facets of argumentativeness (e.g., which document contains more argumentative units close to the query terms). We tested the axiomatic re-ranking with a focus on argumentativeness in the TREC 2018 Common Core track  and also in the TREC 2019 Decision track , where we also added credibility axioms. The results show some encouraging improvements for some of the TREC topics that we manually identified as potentially “argumentative” while the generalizability to more topics needs some further investigation (for more details on axioms and results, see our respective TREC reports [3, 4]).
Identifying Comparative Questions
As a first step towards allowing questions as inputs to the CAM prototype, we have studied real comparative questions submitted as queries to the Russian search engine Yandex or posted on the Russian community question answering platform Otvety. We have manually annotated a sample of 50,000 Yandex questions and 12,500 Otvety questions as comparative or not. The comparative questions were further tagged with ten fine-grained labels (e.g., whether the question asks for a fact or arguments) to form a taxonomy of the different comparison intents.
To identify comparative questions, we trained a classifier that can recall 60% of the comparative questions with a perfect precision; we also trained separate classifiers for the fine-grained subclasses. A qualitative analysis after running the classifiers on a one year-long Yandex log of about 1.5 billion questions showed that about 2.8% of the questions are comparative (about one per second with seasonal effects like mushroom comparisons in fall). The majority of the comparison intents cannot be answered by retrieving similar questions from a question answering platform and go way beyond just comparing products or asking for simple facts. A search engine that wants to answer comparative questions in their entirety—like our envisioned CAM system—can thus not just rely on a knowledge graph or on online question answering platforms (for more details, see our respective paper ).
Touché: Shared Task on Argument Retrieval
To foster and consolidate the research community dealing with argument search and retrieval, we are organizing the Touché lab at CLEF 2020:Footnote 6 the first shared task on argument retrieval . The Touché lab has two subtasks: (1) the retrieval of arguments from a focused debate collection to support argumentative conversations, and (2) the retrieval of argumentative documents from a generic web crawl to answer comparative questions with argumentative results.
In the first subtask, we address the scenario of users who directly search for arguments on controversial or socially important topics (e.g., to support their stance or to form a stance) while in the second subtask we address the scenario of personal decisions from everyday life in form of comparative information needs (e.g., “Is X better than Y for Z?” similar to our CAM prototype). For the first subtask, we provide a dataset of more than 380,000 short argumentative text passages crawled from online debate portals, and the task of the lab participants is to retrieve relevant arguments for 50 given topics that cover a wide range of controversial issues. For the second subtask, the dataset is the ClueWeb12, and the task of the lab participants is to retrieve documents that help to answer 50 comparative questions given as the topics.