Keywords

1 Introduction

This paper addresses a central question within research on argumentation, namely: What makes a good argument? [29, 41, 43,44,45]. The literature so far has established, that the quality of an argument has many dimensions, which pertain to the content of the arguments themselves as well as their rhetorical “packaging”. In our project Visual Analytics and Linguistics for Capturing, Understanding, and Explaining Personalized Argument Quality (CUEPAQ), we have built on our expertise in the linguistic analysis of argumentation [11, 14, 38] to explore the hypothesis that argument preferences are, in fact, often more subjective than the current state of the art in the literature leads us to believe (cf. [41]). More concretely, the project focuses on the effect of linguistic features on personalized argument preferences.

For this, we have developed a new application, the CUEPipe. This pipeline allows researchers to generate data sets for assessing personalized argument preferences as well as annotating these data sets for argument preference. Expecting different results from different annotators, we also provide a platform for exploring personalized argument models learned from the annotations. Thus, the CUEPipe allows linguists to investigate argument preferences, including our claim that argument preferences are, to some extent, subjective. In this paper, we describe three major components of the application:

  1. i.

    An interface for generating a corpus of arguments and exploring its linguistic feature diversity

  2. ii.

    An interface for labeling pairwise comparisons between arguments

  3. iii.

    An interface for exploring personal argument preferences

We illustrate each of these steps based on a proof-of-concept use case by reporting our own experiences with the application and the results of a user study tailored towards testing the visual interactive labeling aspect of the application and the exploration of personal argument preferences. For this, our declared goal was to explore whether how we attribute beliefs to different entities affects how we perceive the corresponding arguments. We do this by looking at how propositional attitude verbs affect argument preferences.

The paper is structured as follows: In the next section, we describe the concepts explored in CUEPAQ in more detail. In Sect. 3, we describe how these concepts relate to the CUEPipe  and in Sect. 4, we describe our pilot use case involving user studies. Section 6 concludes.

2 Background

One main goal of our research is to investigate the impact of linguistic features on argument preferences in a controlled manner. To achieve this, we drastically simplify the complexity often attributed to the structure of arguments, as becomes apparent when investigating the topic of argumentation schemes (e.g., [23, 31, 46, 47]). As such, we rely on the simple idea that “Argumentation is aimed at increasing or (decreasing) the acceptability of a controversial standpoint” [42, p. 4].Footnote 1 In the next section, we motivate this decision.

2.1 Argument Data

We treat arguments as tuples (premise,conclusion,relation), following basic (computational) argumentation schemes [30]. The premise and the conclusion are unmodified linguistic expressions.Footnote 2 They stand in a specified relation to the argument and are taken to be a member of the set {support,attack}. The support relation indicates that the premise increases the acceptability of the conclusion, while the attack relation aims at decreasing the acceptability of the conclusion. An example is illustrated in (1).

figure a

One of the reasons to focus on the simple arguments is to enable the contrastive study of linguistic features by means of minimal pairs. Minimal pairs originated in the linguistic study of sounds and are used to help determine distinctive classes, for example, to determine the phonemes of a language. Beyond phonology, the concept has been applied to different kinds of minimal pairs, prominently syntactic minimal pairs, which have been used, for example, in language acquisition research [10, 20]. Similarly, minimal pair data has been used to judge the linguistic ability of machine and deep learning systems (see, e.g., [27, 48]). Our goal is to investigate whether minimal changes affect the judgment of argument preferences. More concretely, our corpus helps to explore how minimal changes in the premise affect the acceptability of the conclusion of an argument. A typical minimal pair in our corpus is exemplified by (1) vs. (2). As can be seen, our minimal pairs are based on the choice of lexical items that make up an argument. The term minimal refers to the addition, removal, or change of at most one word.

figure b

As discussed in Sects. 3 and 4, these minimal pairs help provide balanced corpora for research into individual preferences as tied to linguistic features.

2.2 Argument Preferences

According to recent work on the assessment of argument quality, argument preferences are affected by various dimensions [43, 44]. However, these dimensions have mainly been used to assess objective argument quality. [41] acknowledge the subjectivity of rating argument quality, but do not explore this further. Our approach is based on the efforts to evaluate how convincing arguments are by [16, 36, 37] and similar approaches, and we treat argument preference as a single value derived from a function over linguistic features values of that argument.

We collect pairwise comparisons of arguments to train models that learn this function (e.g., [16]). More concretely, our approach is based on [36] (also [16, 43]). This means we train argument preference models based on Gaussian Process Preference Learning (GPPL). We chose this model since it is particularly well suited to working with sparse data. Furthermore, [37] opens up new possibilities for future research by simultaneously including annotations from multiple users. As we are primarily interested in the impact of linguistic features on model performance, we focus on using linguistic features for assessing argument preferences, e.g., [5, 29, 44] to train these models.

Based on the collected argument preferences and the models trained on them, we can develop user profiles that explain the linguistic preferences of users. For this, two strategies are pursued in this paper: i) analyzing the feature importance scores that a model assigns during training, and ii) analyzing the most and least preferred arguments of a user using register analysis methods [4].

2.3 Visual Analytics for Linguistics

We integrate diverse methodologies from the domain of Visual Analytics [24] to support argument and model exploration as well as user engagement in the procedural stages of the CUEPipe. We draw upon expertise from prior studies in the field of natural language exploration [5, 11]. Specifically, we derive methodologies from the field of visual data collection [12, 25, 33] to support the process of corpus annotation. We further integrate a new visual interactive labeling component derived from [2, 3, 34] for annotating argument preferences. Finally, we introduce a dashboard designed for the examination of preference models by introducing a new radial evaluation technique based on former approaches to user-centric visualization [8, 18, 34, 35], thus adding to the growing body of work on LingVis: Visual Analytics for Linguistics [1, 7].

3 The CUEPAQ Argument Exploration Pipeline

Our CUEPipe is a web-based application providing graphical user interfaces for various tasks related to the linguistic modeling of argument preferences. In this section, after introducing the overall workflow, we present the individual components, describing their basic functionalities and intended applications.

Fig. 1.
figure 1

The CUEPipe workflow

3.1 The CUEPipe Workflow

Figure 1 shows the overall workflow of the system. CUEPipe provides various interfaces (I) for working on collecting argument data. The (V)isualizations described in Sect. 3.2 provide an intuitive overview of the data set, allowing for its exploration. As described in Sects. 3.3 and 3.4, the labeling process and exploration of argument preferences are also supported by separate interfaces and visualizations.

Furthermore, the workflow in Fig. 1 highlights the different roles of entities interacting with the CUEPipe. It provides access to an extendable argument corpus. However, it is best used to study specific linguistic cues in a targeted data set. Thus, the first important role is that of the linguist. The linguist formulates a hypothesis and defines an expected outcome of the study. Then they generate a data set accordingly. Correspondingly, they may choose to specify a feature set that focuses on the attributes of interest.

The next step is conducting the study. The second role, users, consists of the target group. Here the subjective nature of argument preferences comes into play. The user group of a study can be categorized across different dimensions, e.g., demographic features, such as age, gender, or income. This depends on the goal of the study and the corresponding hypothesis. The task of the user group is to compare arguments pairwisely to create a model that captures their argument preferences reasonably well, as described in Sect. 3.3.

Finally, the role of the analyst is to interpret the resulting preference models and the insights they provide on the user group, e.g., finding clusters. The analyst has a dual role, as it should inform both the linguist and the users. Concerning the users, the goal is to teach them about their argument preferences by analyzing the features that play a role in their preference models and comparison with other models. With respect to the linguist, the analysis needs to communicate the actual outcome of the study, involving information about model performance and other factors that might affect the reliability of the study. This forms a feedback loop. Depending on the study’s outcome, the linguist may want to revise their hypothesis or tweak other variables, such as the used feature space or the argument set. If the result confirms the hypothesis, the linguist still needs to evaluate the created models carefully to ascertain that the results are reliable.

The best use for the CUEPipe may be for prototyping studies to make sure that a more detailed investigation is warranted. However, it also allows linguists to expand on a study incrementally. In principle, the different elements are modular, allowing for individual use, too.

Table 1. Argument distribution in the CAP

In the next few sections, we will present the individual steps in Fig. 1, collection, labeling, and Analysis, in more detail focusing on their implementation.

3.2 Generating a Data Set for Exploring Argument Preferences

The CUEPipe provides a graphical user interface for adding arguments to the Comparable Argument Corpus (CAP) we have developed. The main innovation of the CAP is that it allows adding minimal variations of arguments that contain contrasting lexical items. Thus, the interface is designed to provide a view for adding arguments, a view for varying arguments, and a general argument view that groups arguments and their variations to provide a high-level overview.

Data Collection: The corpus is divided into three levels, new arguments, staging arguments, and corpus. This distinction is mainly for quality control reasons. Arguments, as well as their variations, must adhere to the general structure described in Sect. 2.1: (premiseconclusionrelation), arguments must be linguistically adequate (i.e., no non-sense strings, etc.), and the relation between premise and conclusion must be conceivable (thus, all arguments are assumed to surpass a certain argument quality threshold). After submission to new arguments, two additional data collectors have to confirm these requirements by promoting arguments to staging and corpus, respectively. Consequently, three distinct experts confirm each argument to be suitable for the corpus.Footnote 3 Table 1 describes the current size of the corpus. Variation ratio refers to the average number of variations per argument. Unique standpoints refers to the number of unique conclusions, indicating topic variation in the corpus. Since the goal is to focus on linguistic feature effects on argument preferences, we aim to provide a varied data set that allows the creation of test sets for various topics.

Each argument is annotated with metadata, including relations between arguments (i.e., whether an argument is a variation of another or an original argument), the author label for variations, and the source label for original arguments.Footnote 4 For the sake of keeping the structure simple, there is no nesting of variations in the corpus, so variations generally have 0 other variations (although they may be incidental variations of other variations of the original).

Linguistic Feature Annotations: Each argument in the main corpus is annotated with linguistic features to allow for the exploration of personalized argument preferences. For this, we use several automated feature annotation pipelines. Some of these were borrowed by other work, e.g., [11, 39] and [29], while some features have been implemented actively for the CUEPipe. Particularly relevant for the CUEPipe are features introduced by lexical items, including the concrete use of certain items and additional properties. Examples of this are embedding verbs, noun and verb modifiers, and different types of negation (verb vs. noun). As an example of additional annotations related to the concrete lexical items, we use the semantic parser by [21] to distinguish different kinds of intensionality (veridical, averidical, and anti-veridical). Overall, the application supports 66 linguistic features, ranging from stylistic to semantic. These features are organized into feature groups that give an intuitive understanding of their expected role in analyzing personal argument preferences.

Corpus Exploration: In addition to the corpus management functionality, we provide a visual exploration dashboard to interact with the data in the corpus. This component primarily serves to inspect feature distributions and interactions in the corpus. It consists of three parts: the argument similarity map and the global and local co-occurrence matrices.

Fig. 2.
figure 2

Feature exploration

The argument similarity map, as the name suggests, maps arguments onto a two-dimensional space as circles. It distributes them according to their similarity based on their annotated features. For this, we use an off-the-shelf dimensionality reduction (principal component analysis, PCA; [17]) to reduce the linguistic feature vectors to two dimensions. The map can be customized for selected feature combinations. Thus, distributions of different feature categories based on the analyst’s interest can be evaluated in this way. Moreover, different feature sets, or individual features, can be mapped onto the x- and y-axes of the map. As shown in Fig. 2, the selected features for each axis are reduced to one dimension each. This allows linguists to compare the distribution of features or feature groups in relation to the overall complexity of the corpus. Figure 2 illustrates this by presenting the distribution of the feature averidical-ratio relative to the full feature set. As the picture suggests, many arguments do not indicate averidicality in the selected argument set. However, of those marked with averidicality, we can see that they are somewhat distributed across the data set.Footnote 5 Linguists can select arguments of interest, such as argument clusters, outliers, or arguments of a certain value, for closer inspection to refine the information provided by the argument similarity map. As shown in Fig. 2 on the right, researchers then see global and local feature co-occurrence matrices. As the name suggests, these visualizations present feature collocations. The global matrix displays pairwise interactions within the selected subspace in the upper right corner. Darker shades indicate a high number of feature co-occurrences, while brighter shades indicate fewer feature co-occurrences. When a cell is selected, the local matrix (bottom right corner) shows how two features interact in close detail using the same overall method. Thus, the local co-occurrence matrix in Fig. 2 suggests that, in this selection, many arguments contain one propositional attitude verb expressing a level of veridicality.

Overall, the argument exploration dashboard can be used to find balanced data sets for specific features and to explore and reduce imbalances in test sets. Furthermore, it provides an overview of the coverage of the corpus.

3.3 Learning Preferences via Visual Interactive Labeling

Our goal is to learn preferences from pairwise comparisons, as illustrated in Fig. 3. There, two different arguments are presented. In accordance with our definition of an argument, choosing the preferred argument involves choosing the argument for which the premise better affects the conclusion (i.e., increases or lowers the acceptability of the conclusion). This task can be varied across various dimensions, e.g., by only presenting premises affecting the same conclusion, or only arguments with support relations, etc. Thus, the system allows for some flexibility concerning the definition of comparison tasks.

Fig. 3.
figure 3

Visual interactive labeling

The annotation of argument preferences is an extremely expensive task due to the fact that the number of comparisons n in a set of arguments with size x exhibits quadratic growth (n = ( \((x * (x-1))/2\)). Thus, a full annotation of 30 arguments already requires 435 comparisons. Because we want to test personalized argument preferences, we cannot use multiple annotators for the same model to reduce the annotation cost per annotator. Consequently, we have developed a system that is aimed at supporting this costly annotation process and possibly reducing the number of annotations needed to make valid predictions about a user’s argument preferences.

Learning Argument Preferences: For learning preferences, we represent arguments as linguistic feature vectors based on the annotations explained in Sect. 3.2. As an underlying model, we use a model for pairwise preference learning based on Gaussian Process Preference Learning [36, 37], a type of Bayesian inference model. These models define a real-valued function f that takes linguistic feature values as input and can be used to predict rankings, pairwise labels, and ratings for individual arguments [36]. Concretely, ratings are represented as numeric values provided by f, where higher values correspond to a stronger preference for the given argument based on its features. Pairwise labels are predicted via the preference likelihood \(p(i \succ j|f(x_i), f(x_j))\), where \(i \succ j\) is a pairwise label comparing two arguments (i.e., argument i is better than argument j).

The application does not hinge on this choice of model. However, preliminary tests have shown the model’s suitability for testing the overall pipeline. We primarily relied on its good performance on sparse data, allowing it to learn from relatively few comparisons, making it more feasible to learn the preferences of individual users.

Visual Interactive Labeling: The visual interactive labeling process is divided into two parts. First, a small random subset of comparisons is sampled from the data set that is to be annotated. A user annotates this subset to provide some initial comparisons for model training. Once the subset is fully annotated, the second stage begins.

In the second stage, the user is supported by information from their preference model. Figure 3 illustrates our interface for visualizing model information. On the left-hand side, the two arguments are presented side-by-side. They are compared on a 5-point scale corresponding to the position of the arguments (i.e., A1 is the left argument, and A2 is the right argument). The visualization on the right side guides users through the annotation process. It can be divided into two parts divided by the arguments (represented by their IDs) as the spine of the visualization. On the left side, an arc diagram provides information about the overall annotation progress by visualizing the already annotated argument pairs in gray. Additionally, the arc diagram visualizes predictions by the user’s model: the five green arcs suggest candidates for the next comparison based on the model’s variance predicted for these comparisons. These suggestions are calculated globally across all arguments by default. The current arguments displayed for comparison are highlighted in pink as the comparison most favored by the model. The user can change the next pair of arguments by selecting other green arcs or clicking on single arguments. This action can become relevant when including argument-specific information in the decision process. As displayed on the right side, each argument is represented as a tuple of bar charts describing its number of annotated comparisons in orange, its assessment of the certainty of this score in blue, and its predicted absolute preference score in red. Sometimes, relying on the variance alone leads to a situation where only a certain subset of arguments is frequently annotated while other arguments are not annotated at all. Users may wish to strive for a more balanced annotation process. The visualization gives them the flexibility to do this. The visualization also allows users to investigate their annotation process by showing them the predicted ranking of the arguments based on their model. Thus, in addition to the concrete display of the accuracy value of the model, users can also confirm that the model learns the appropriate expected rankings for individual arguments.

Ultimately, this visualization serves to investigate strategies to quickly increase model accuracy, particularly during the annotation of large data sets. Once a large number of arguments is involved, it becomes unfeasible to annotate them all. Thus, doing the right annotations to increase model predictability is essential. As of now, we rely only on data from the trained models; However, the issue has been gaining more attention recently (e.g., [13]). Thus, future work aims at improving the model’s capability to select meaningful comparisons.

3.4 Exploring Personal Preferences

The CUEPipe provides functionality for exploring preference models based on the previous steps of the pipeline. Concretely, we provide functionality for model performance analysis and model comparison across different users.

Model Performance Analysis: The application allows users to apply models to arbitrary data sets, allowing users to test them on unseen data. For this, k-fold cross-validation is provided as well (for k = 5). This allows users to train the model on larger data sets involving both seen and unseen data, providing a more in-depth understanding of the performance of a user’s model. Additionally, we have added functionality for re-calculating the model training history. This allows us to investigate the model’s performance in relation to the annotation progress of a given data set.

Fig. 4.
figure 4

Model exploration

Comparative Model Exploration: The main visualization is presented in Fig. 4. It allows for the exploration and comparison of user models according to their predicted preferences. Again, we make use of a principal component analysis to project high-dimensional feature importance vectors provided by user models on a two-dimensional, radial space. Hence, models that are displayed close together share similar feature importance vectors. We use this metric as an indicator of the impact of linguistic features on the prediction of argument preferences. Thus, different feature importance values indicate different argument preferences.

To illustrate these differences, we separate the space into multiple slices by displaying a visualization similar to a pie chart. The user may determine the number of slices. The individual pieces of the pie describe potential model clusters, i.e., models with similar feature importance patterns. The feature importance vectors of these models are aggregated and visualized in the outer ring of the visualization. This provides users with information on differences between the various clusters. Color is used to affiliate the user model to the respective arc and to display important feature differences in the outer ring, thus supporting the differentiation between the different model clusters.

The model comparison visualization allows an analyst to cluster users and find commonalities between their models. To further explore the models, it is possible to extract top and bottom arguments from the annotated data sets (and beyond, if model performance allows it) and feed into the previously presented argument exploration view (Sect. 3.2). There, feature distributions in the different sets (all, top, and bottom arguments) may be inspected.

4 Study: Propositional Attitudes

We conducted a proof-of-concept case study to evaluate the functionality of the system. The study consists of a linguist creating a data set for exploring the impact of propositional attitude verbs on argument preferences. Subsequently, users were asked to compare arguments to learn their preferences. Finally, the results were analyzed using the presented model exploration functionality.

Creating Data Sets: For this proof-of-concept study, we created a data set consisting of arguments containing propositional attitude verbs, a kind of embedding verb encoding the commitment of a source to the embedded content.

More concretely, the properties we are interested in relate to the intensional nature of (some) embedding verbs [9, 26]. One important notion is factuality or factivity [22], which also receives regular attention in the computational literature (e.g., [32, 40]).

For the sake of this paper, we understand factivity as a continuous value that describes the degree of commitment attributed to the content embedded under factivity markers, e.g., discover in (3-a) or believe in (3-b). However, as (3-b) illustrates, the source that the commitment is attributed to is also relevant. Thus, the fact that 3 out of 50 lawyers believe that piracy is theft is not a strong premise for the standpoint piracy is theft (cf. (3-b)); however, were it 47 out 50 lawyers, then despite the less strong commitment indicated by believe (compared to discover), the premise might still make a good support.

figure c

The test and training data sets were created by a linguist based on 88 arguments containing propositional attitude verbs in the corpus. The data set was skewed towards embedding verbs claim, think, agree, and show. This is illustrated in Fig. 5 (right side). The embedding verbs used in the 15 test arguments are shown on the left of Fig. 5.

Preference Learning Experiments: We tested five participants (i.e., users) for this study. All of them had an academic background (three student assistants and two postdocs). Each participant did two annotation sessions. In the first session, they annotated preferences based on ten arguments, resulting in 45 random comparisons. These later serve as test sets for the models trained on their training sets comprising 105 comparisons. Due to the sparseness of the data, we tested the model both on seen and unseen data. The results in Table 2 show that while the model learned argument preferences for some of the users relatively well within the seen data, applying the models to unseen data shows that they have not learned enough to make general predictions about the argument preferences of the users (tested by combining the training and test sets and applying k-fold validation with k=5, as provided by the application).

Table 2. Model performance (accuracy) across users
Fig. 5.
figure 5

Embedding verbs in the training set (left) and corpus (right)

Model Exploration: We fed the four models based on the annotation study into the model exploration dashboard. The dashboard shows that the three users with relatively high accuracy on the seen data form a cluster with respect to the feature importance values of their models. User1 and User3 formed their own clusters (see Fig. 4). We can see that the models with the best accuracy metrics generally have higher feature importance scores across features. This suggests that their preference patterns are more consistent with underlying linguistic features. However, comparing the cluster of three in isolation reveals considerable differences in the importance of argument features, suggesting that the models are still quite distinct. Concretely, for User5, positive and negative sentiment were important features for the model, but the semantic features of veridicality and averidicality (i.e., those pertaining to factivity) did not seem to play a role. Conversely, User2 put focus on neutral sentiment, and the semantic features concerned with factivity were among the most important ones of their model. Finally, User5 model was mostly affected by features pertaining to linguistic complexity, while sentiment and factivity features played only a minor role.

Overall, the study’s compactness requires us to take these assumptions with a grain of salt. Nonetheless, this proof of concept shows that the pipeline can be used to inspect personal argument preferences across multiple users within a few arguments. Informally collected feedback from the five users was very positive on average, although smaller technical issues occurred during the study. However, we will leave a more detailed analysis of the system’s usability for future work.

5 Limitations

In this section, we discuss two limitations that pertain to the CUEPipe itself and to limitations of our proof-of-concept study.

5.1 The CUEPipe

Currently, CUEPipe is best suited for smaller pilot studies tailored toward the initial investigations of hypotheses. This limit is imposed on a technical level as well as on the level of implementation of the visualizations. On a technical level, the limit pertains mainly to the visual interactive labeling step. In the current implementation, model updates needed for making meaningful annotation suggestions require complete retraining of the model. This works well on smaller amounts of data but can interrupt the annotation process as the models become larger, both in terms of feature annotations and the number of comparisons. To some extent, this can be solved implementationally by optimizing training procedures (e.g., by running them asynchronously in the background and adapting the interface between the annotation interface and the trained preference models accordingly). Another possibility that could be explored is to create crowd models [37] and, for example, merge models of users with similar annotation behaviors. However, this would obviously lead to larger but (potentially) fewer models. Thus, ultimately, large-scale studies based on pairwise comparisons may require a more powerful infrastructure than is currently available.Footnote 6

Another issue of the CUEPipe is that the visualizations do not always scale optimally with increasing data complexity. In particular, representing complex feature annotations can clutter visualizations. Thus, organizing and representing linguistic features intuitively is an ongoing concern. Our goal is to improve on the current state which only allows the selection and deselection of features. A more ambitious approach would be to incorporate guidance. For example, in the radial exploration visualization, such a system could attempt to automatically detect features relevant to distinguishing target groups and highlight those.

5.2 The Proof-of-concept Study

The study focused on the system’s overall usability, concentrating on the workflow described in Sect. 3.1. As mentioned there, the study should be seen as a prototype study. The main drawbacks are as follows:

The study participants have not been selected with certain demographic properties in mind. Although the system can and should be used to find differences in seemingly homogenous groups (in this case, all participants were academics), a study that is geared towards predicting predefined clusters in a target group may illustrate the validity of the system more clearly.

Overall, the study is small-scale. Thus, as mentioned in Sect. 4, the results should be taken with a grain of salt. This is further compounded by the fact that we rely on automated feature annotations. While this is fine for some features, e.g., those pertaining to language complexity, in particular, the meaning-oriented features, such as veridicality, need to be carefully evaluated to avoid propagating wrong information in the analysis stage of the system. For example, the system broadly captures the right generalizations regarding the relation between attitude verbs and veridicality, but there exist some outliers that can falsify results. Concerning the first problem, future studies are planned with a focus on exploring the individual properties of target groups. Regarding the second problem, including evaluation metrics for features may make the system more transparent.

6 Conclusion

We have presented an application combining three major components for researching personalized argument preferences: data collection, preference labeling, and preference exploration. We also contribute a small (but dynamic) corpus of linguistically annotated arguments and various techniques for visual analysis of linguistic data. The CUEPipe application has been demonstrated by means of a proof-of-concept study, indicating that the overall workflow is successful.

The pipeline offers up multiple avenues for future work, e.g., facilitating comparative annotation, the visual representation of linguistically annotated data, and the visual exploration of linguistic models. Overall, the CUEPipe provides exciting prospects for exploring personalized argument preferences. Its coverage of various major tasks in linguistic research makes it interesting for everyone working on argument preferences. Furthermore, its ease of use reduces the barrier to conducting various tasks for users new to the topic.