VisGIL: machine learning-based visual guidance for interactive labeling

Labeling of datasets is an essential task for supervised and semi-supervised machine learning. Model-based active learning and user-based interactive labeling are two complementary strategies for this task. We propose VisGIL which, using visual cues, guides the user in the selection of instances to label based on utility measures deduced from an active learning model. We have implemented the approach and conducted a qualitative and quantitative user study and a think-aloud test. The studies reveal that guidance by visual cues improves the trained model’s accuracy, reduces the time needed to label the dataset, and increases users’ confidence while selecting instances. Furthermore, we gained insights regarding how guidance impacts user behavior and how the individual visual cues contribute to user guidance. A video of the approach is available: https://ml-and-vis.org/visgil/.


Introduction
Machine learning (ML) has been successfully applied in a variety of domains in recent years [5,23,47,49,61,79,87].While the increase in model accuracies is widely viewed as promising, there is a bottleneck for such applications: many of these accurate models rely on supervised learning.Hence, a large amount of labeled data is required to train these models.The acquisition of data is becoming less difficult in recent years through advances in research and technology such as enterprise databases, web crawlers, pervasive computing, and the Internet of Things.The provision of the corresponding labels, however, is still a major challenge.This often prevents the application of ML in those productive environments where labels are not available.This problem has for example been stated in [2,42,67,74].
To label the data, expert knowledge is needed.The labeling process is therefore associated with a considerable amount of time and cost, as the experts have to carefully examine each instance.Otherwise, the data can become useless for training or result in faulty models.
B Benedikt Grimmeisen benedikt.grimmeisen@hs-aalen. Active learning (AL) [56] is an approach to reduce the number of instances that are needed to be manually labeled.AL uses different strategies to query the user to label instances that are considered informative.While in AL, users are constrained to label suggested instances, a complementary approach is to use interactive visualizations, where instances can be freely labeled by the user leveraging methods of Visual Analytics (VA).Bernard et al. [7] compared AL and user strategies and derived strategies the users followed when selecting instances to label.In [7], it was shown that pure user-driven selection could lead to a biased and suboptimal selection.With visual-interactive labeling (VIAL), Bernard et al. [10] proposed an approach combining the complementary strengths of user-based selection and modelbased suggestions to label the best possible candidates.Our work bases on the aforementioned research, specifically on [10].While VIAL and the approaches implementing it, focus on the interplay of users and active learning models, in this paper we illuminate interactive labeling from the perspective of guidance [14].We propose the approach VisGIL 1 which guides the user in the selection of instances based on utility measures deduced from the active learning model.
We address these two research questions: -RQ 1 : How to guide users toward instances with higher impact on the model when a selection of instances is available?-RQ 2 : How does guiding users in the selection process affect their selection and labeling strategy as well as the resulting classifier?
To achieve a higher accuracy for ML models, we present Visual Guidance for Interactive Labeling (VisGIL), a VA approach which calculates the utility of each instance for training a classifier and presents it together with recommendations of particularly useful instances within an interactive user interface (see VisGIL in Fig. 1).The user is thereby guided by the model by means of visual cues.However, the user can freely decide to either follow the guidance or to choose other instances to label.Iteratively, an ML model is trained on the subset of user-labeled data points and propagates the labels by classifying the unlabeled data points.The user may decide to refine the labeling or to confirm the manual labels together with the label propagation conducted by the ML model.
This paper presents the extension of the work presented in [28] with emphasis on further evaluation of the proposed approach.Therefore, the scope of the given user study was enhanced by a quantitative analysis using statistical methods.Moreover, a think-aloud test with different participants and a more complex dataset was conducted and evaluated.
This work makes the following contributions:

Active learning
In Active Learning (AL) [56,57,70], an iterative training process is used.The active learner queries a human oracle for labels of informative data points.Informative instances are those that will most likely improve the model's accuracy.By selecting the most effective instances for training, the aim is to minimize the number of user interactions.Strategies for the selection of instances can be grouped into different categories depending on the metrics and sources [7,9,26,83].In general, these strategies can either use information from the model or the data.
Uncertainty sampling is the most common model-based strategy.It selects the instances for which the model is most uncertain about the class.Those are usually located at decision boundaries [80].Query-by-committee [60] extends this concept by considering several classifiers in an ensemble and their disagreement with the instances.
Relevance-based selection [72] strategies focus on instances that are particularly relevant to a class.Strategies selecting instances that are likely to change the trained classifier the most are referred to as error reduction schemes.These aim to reduce the training error of models [59], for example.
In contrast to model-based strategies, data-based strategies only use information from available data to select instances.For example, the correlation of individual features and labels-in the case of multi-label classification-is used [83].Recently, approaches were presented that combine model and data-based strategies to avoid redundancies and outliers in the selection process [26] by considering such correlations.Moreover, there exist strategies focusing on the selection of diverse instances (diversity-based sampling [30,80,83]) or areas with high density (density-based sampling [58]).He et al. [30] not only take uncertainty and representativeness into account but also the diversity of the data.This approach exceeds the performance of previous approaches, relying on one or two criteria for the selection of instances in an experimental comparison on several evaluation datasets.
Since in AL the instances to be labeled are selected solely by the model, users have no influence on this selection.Hence, available expert knowledge as well as recognized patterns, anomalies or clusters are not considered.While AL approaches may produce good models in a fast manner, according to [13] users often feel frustrated having to answer long sequences of queries.In [3], the authors state that users are not satisfied to act as an oracle solely giving simple labels, but rather want to be involved in the interaction and aim to understand the model better.Unlike pure active learning methods, VisGIL combines the approach of active learning with interactive labeling to let the domain expert influence the data selection.

Interactive visualization
While in AL the model controls the workflow, visualization methods can be used to let the user freely select and label potentially informative instances.VA enhances the cognitive abilities of humans by methodically preparing and presenting data in interfaces with interactive visualization [25].Thereby, people can gain deeper insights into heterogeneous and complex data through explorative knowledge discovery [25].VA focuses especially on the visual presentation of these data using a variety of projection methods, plots and glyphs, as this is the most effective way to convey information to users and to help build their understanding.
Related to this work, user interfaces have been used for the user-based selection of instances for labeling and training.The information provided by the visualization can be used for a meaningful selection of instances.This user-based selection of instances is the counterpart to the model-based selection of instances within AL.
Like the querying strategies in AL, users also follow certain strategies when selecting interesting instances.Bernard et al. [7] compared AL and user strategies in a study.Based on the observations of the subjects, 10 strategies were identified, which users followed to select instances for labeling.In [9], formal building blocks were derived from these strate-gies, systematically describing recurring thinking patterns and concepts.Resulting user strategies can be categorized into data and model-based strategies, similar to AL.In the data-based approaches, only information from the dataset, the instances, and their projections are used for the decision.Users focus on the distribution of labels across the entire dataset (Equal Spread), point clouds with high density (Dense Areas First) or on prominent instances such as centers or edges of visible clusters and outliers (Centroids First, Cluster Borders First & Outliers First).Also, some users try to find and label the instances that best represent the class and seem perfect for them (Ideal Labels First).For the modelbased strategies-in addition to the data-based information -the class labels predicted by a model are also considered in the decision.Users focus on eliminating overlaps (Class Borders Refinement & Class Intersection Minimization) or large distributions (Class Distribution Minimization) of class classes in the representation.Some users focus on instances that are far away from the perceived center of the class (Class Outlier Labeling).
VisGIL extends the interactive linked views for interactive labeling [10].Namely, a label distribution view is added besides Instance Overview & Selection in (see Fig. 1).In particular, our work enhances previous work by explicitly considering and evaluating user guidance.

User guidance
It has been experimentally shown in [7] that purely userdriven selection and labeling of instances can lead to a biased and sub-optimally labeled dataset.While leaving the user in control, supportive guidance appears to be beneficial.The term "guidance" is a broad concept that leaves much room for interpretation.Derived from various interpretations from different areas such as human-computer interaction or visualization, Ceneda et al. [14] describe guidance in terms of VA as "[…] a computer-assisted process that aims to actively resolve a knowledge gap encountered by users during an interactive VA session."Guidance aims at supporting the user in fulfilling a task through a dynamic process.The task can be split up into a series of actions that eventually lead to the completion of the task.Guidance supports the user in situations where they are unable to identify or execute at least one of these actions.The process becomes dynamic as the target result is unknown.In this case, subsequent tasks must be derived from prior actions.Guidance is not supposed to explain the current state of the visualization or how it was created.By forward directed assistance, the users are supposed to develop this sense by themselves.Whereas there is guidance by humans toward machines, guidance in respect of this definition and the rest of the work refers to guidance provided by machines to humans.
In their guidance model, Ceneda et al. [14] characterize user guidance through a knowledge gap, input and output, and guidance degree.Three guidance degrees are distinguished: (1) orienting which is the lowest degree and supports users to build a mental map with hints on possible targets or paths toward the solution.For this purpose, visual cues in visualizations are often used.(2) directing which emphasizes a possible course toward a solution.Alternative options that lead to the desired result are offered.Users can accept or ignore these suggestions.(3) prescribing which is the highest degree of guidance.Decisions on the next steps are made in an automated process leading toward the desired result.These three degrees are dynamic, allowing to switch between them as necessary.
Building on the theoretical guidance model presented by Cenada et al. [14], Collins et al. [18] proposed several practical conclusions for a promising implementation of user guidance for human-machine processes in VA.In addition to conceptual goals for good guidance, the approach describes complementary requirements for intelligent guidance.This allows the application not only to trivial visualization tasks, but also to more sophisticated model generation tasks in the field of ML.To operationalize the described guidance, the authors specify concrete building blocks for the implementation and opportunities for the evaluation.

Related work: visual interactive labeling
Based on the analysis of previous approaches, Bernard et al. [10] proposed visual interactive labeling (VIAL), a unified process for the interactive labeling of instances with the user in the loop.VIAL aims to combine the mutual strengths of AL and interactive visualizations.In the iterative approach, there are two complementary alternatives for identifying instances for labeling: Interactive Visualization and Candidate Suggestion.Whereas the users are queried by the model with automated suggestions in the AL perspective, they take a more active role in the interactive visualization perspective by exploring and selecting instances.The VIAL process proposes to either include AL-based guidance concepts included in visual interfaces, or visual-interactive interfaces for the analysis and steering of AL strategies [10].In their work, Bernard et al. [10] describe the six steps of the process as well as challenges and inspiring examples for both perspectives for possible implementations of the suggested process.VisGIL is based on the VIAL framework and adds visual cues and guidance to better steer the user decision for data selection.
The user-based AL approach presented by Seifert and Granitzer [55] enables users to select instances for labeling using an interactive visualization that displays the aposteriori output probability of the classifier to be plotted in a star-shaped graph.The selection of instances is solely based on the user's decision, which can be based on the current confidence of the model.
Kucher et al. [39] implement the VIAL process with ALVA for stance classification of social media texts.An AL approach is used for the model-based selection of instances, which are sequentially presented to domain experts for labeling.Users can either label the proposed instance or ignore it.The result of the labeling process as well as the development of the model is displayed after completion of the labeling sequences via an interactive visualization where the results can be analyzed with different views and filters.
Han et al. [29] implemented the VIAL process for the visual interactive labeling of video snippets of a news broadcast including the subtitles as metadata.They use an AL Model to classify the subtitle content by subject, as manual labeling would take too long.Users can select reports to label via an interface.In addition to the predicted class, the visualization displays a confidence level for each instance allowing users to evaluate the usefulness of individual instances similar to Uncertainty Sampling in AL.
The mVis [15] approach extends the VIAL process in a way that users can use linked interactive visualizations for the selection of instances for labeling, in which not only a spatial distribution of the data but also the predicted or user-assigned classes are displayed.To guide the user in the selection, various parameterizable ML methods like clustering, clas-sification and AL are used.While all three methods can be toggled at any time, the AL method-if active-periodically provides labels for selected instances.With all methods, unlabeled instances are assigned to new or existing classes, with the user accepting or rejecting the assignments.Ritter et al. [50] introduce a visual-interactive approach for the classification of music genres.In addition to a list view containing instances that the model has proposed using an AL strategy (uncertainty sampling), all instances are shown in a scatterplot where users can select instances for labeling.To point users without expert knowledge to the most relevant instances for the model, an AL model is used for guidance.Instances with low relevance are shown more transparently and less emphasized in the visualization.
Luo et al. [42] propose an approach for the segmentation of images together with the labeling of these segments.They use Fisher rules and explicitly address the challenge of input data with limited quality.
Ali et al. [2] address the labeling of univariate and multivariate time series.They propose an interactive visual approach employing a connected scatter plot of the data projected to two dimensions using either PCA, t-SNE or UMAP.For multivariate time series, they apply a deep convolutional autoencoder prior to the projection.
Motivated by the study of animals in their natural environment, Walker et al. [75] propose an interactive approach to support the interactive classification of time series.
Fan et al. [22] present an approach for the detection of network anomalies through smart labeling.Using an interface with several linked interactive visualizations, users can select meaningful candidates for labeling.Users are supported in their selection by an AL model that uses a strategy similar to density-based sampling to calculate the influence of each instance.The calculated influence is mapped to the radius of the representation in the visualization.
Bernard et al. [8] introduced a system to assess patient well-being based on multivariate time series.A classifier is trained with an AL approach.Visualizations suggest useful instances to an analyst, that can be labeled.Analysts can only control the instance selection method, but cannot select instances directly.
To train an ad-hoc classifier for the labeling of video sequences via AL, the approach of Höferlin et al. [32] supports users in recognizing inconsistencies between their mental model and the classifier by visualizations and allows direct adjustments of the latter.This approach represents a hybrid, as instances are either selected manually by a user via an interactive visualization or automatically by a model.
Heimerl et al. [31] present three methods to visually train classifiers for text document retrieval, one of which uses a visual-interactive approach in combination with AL.The system consists of an interface containing six different visualizations and views.Besides a view with the projection of the data, which additionally shows the state of the model via color-coding of predicted labels, an additional view with uncertain instances is available to the user.This view shows instances that are close to the decision boundary of the model.The selection of instances for labeling is solely done by the user, who can use any of the six visualizations for the selection.There is no automatic selection of instances from the model that goes beyond the visualization of the most uncertain instances.
Jónsson et al. [38] introduce Exquisitor, a scalable interactive learning approach in which relevant candidates of new exploratory categories are suggested to users based on clusters.These candidates are evaluated by users for relevance at each iteration and then used to train a classification model, which in turn provides new, improved suggestions.Guidance is provided to users through this approach by providing them with specific candidates whose relevance is to be judged.
In order to optimize the interactive labeling process of large-scale multimedia datasets, Zahálka et al. [84] present Blackthorn, an interactive multimodal learning framework.The presented learning approach is based on the users' relevance feedback on the instances returned by the model while fulfilling both performance and relevance requirements of the returned information to support the users in gaining knowledge.
Xian et al. [81] present a visual analysis tool that utilizes projection-based hierarchical visualization to enable the correction of mislabeled training data.Using auxiliary information of selected instances such as comparison with trusted items, the quality of large and noisy datasets can be explored and improved in an interactive way.During the process, domain experts provide a small set of trusted instances through manual exploration and labeling, which is intended to detect and correct possible label errors using a label propagation algorithm.
With II-20, Zahálka et al. [85] propose an approach for interactive and visual categorization of image collections.A novel model is used that aims at overcoming static characteristics of traditional models and thus coming closer to human behavior in categorization.This model is used for the generation of relevant suggestions as well as for the automatic categorization.An innovative Tetris metaphor is introduced for the interaction with the visual categorization, synergizing with the model.
In [6], Beil et al. propose the cluster-clean-label approach for interactive labeling of instances combining density-based clustering with interactive filtering of the identified clusters.The main idea is to cluster the data according to the expected classes and incrementally remove data points.This is achieved by training autoencoders for each of the identified clusters and present the instances with their corresponding reconstruction error.In an interplay of model and users, the clusters can be iteratively cleaned and in a final step be labeled.
In [66], Theissler et al. discuss the challenges of visual interactive labeling for supervised anomaly detection.The approach starts in an unsupervised setting and moves toward a supervised setting in a stepwise manner, based on the labeled data and a sequence of ML models.The authors state that the massive class imbalance poses an additional challenge.In addition, the concept of "manual overfitting" is discussed, i.e., the threat that users may label toward easiest class separability rather than solely using expert knowledge.

Discussion and research gap
The main goal of the discussed approaches is efficient collaboration of models and users: Using few interactions, a large part of the data should be labeled.
Some approaches focus on AL, e.g., ALVA [39] uses a visual-interactive interface to analyze the AL procedure.Contrary to this, Ritter et al. [50] and Fan et al. [22] use this type of interface in their VIAL-based approaches to convey AL-based guidance concepts to the users who select instances for labeling.
Also, there are approaches implementing the VIAL process [10], such as of Kucher et al. [39], which uses ML/AL methods to select instances and visualization to analyze the results.Although it is possible to ignore or filter suggestions and control selection parameters, the user has little to no influence on the immediate selection of instances.
Similarly, in the approach of Jónsson et al. [38], users can assess the relevance of candidates suggested by a model, yet a completely individual, exploratory selection of instances is not supported.In terms of the aforementioned degrees of user guidance [14], this corresponds to directing or prescribing.
In contrast, the approaches of Ritter et al. [50] and Fan et al. [22] implement the complementary facet of VIAL with a user-based selection of instances via an interactive visualization that uses different visual cues to mediate the AL model.The users create their own mental model, compare it with the visualized state of the AL model and select appropriate instances fully autonomously.Related to the degree of guidance this corresponds to orienting.Besides, there are hybrid approaches [15,50] that allow model-based suggestions of instances as well as individual selection of instances simultaneously.This is implemented by two individual, parallel visualizations in the interfaces.
Although extensive research has been done on various datasets to combine AL and interactive labeling, none of the approaches were implemented with the concept of guidance according to [14] in such a way that both, the user-driven selection of instances and the model-based suggestions, are unified in one visualization with visual cues.Such an approach would, however, be beneficial since users would not have to decide between guidance by orienting or prescribing but rather have the combined spectrum with the guidance degree directing at their disposal.Therefore, VisGIL is proposed to further improve current VIAL implementations.

The approach: visual guidance for interactive labeling (VisGIL)
We present the hybrid approach VisGIL which combines user-and model-based selection for interactive instance labeling (see Fig. 3).VisGIL is designed for multivariate datasets.Multivariate datasets can be shown as a table, in which rows are records and columns are attributes.
Starting with unlabeled data, users select and label instances.The user is guided from the very beginning of the process by suggesting useful instances and showing them with visual cues.By calculating suggestions in the first iter-ation using the representativeness of the unlabeled instances instead of an untrained model solely, we overcome the cold start problem known from AL.An interplay between the user and a classifier takes place to collaboratively label the entire dataset with a low number of user interactions.Therefore, the classifier is iteratively trained on the current set of manually labeled data points to propagate the labels by tentatively assigning predictions to unlabeled data.
The approach implements the VIAL process [10] with the two complementary components Candidate Suggestion, and Result Visualization to identify candidates to be labeled.The process consists of three steps, which are reflected as areas in the user interface of VisGIL (see Fig. 1).
In the first step, similar to Norman's Stages of Action [44], the user evaluates the data and the state of the ML model by comparing it with his mental model and initial goals.To do so, users can use the Instance Overview & Selection to get an overall impression of the data and label distribution as well as the ML model status.As shown in Fig. 1, the user is provided with the calculated utility of each instance along with recommended instances.The information provided in the Support Visualizations area provides additional details about this distribution and the model in order to validate the information against the mental model.This also seeks to prevent the overlooking of classes as well as the creation of a class imbalance.Especially at the beginning of the process, when the user has assigned only a few or no labels and therefore the model is insufficiently trained, the calculated utility and the suggestions are essential for the user to identify instances for further inspection and eventually labeling (Fig. 2a, b).Based on this, users select the instances to label (Fig. 2c) by clicking on them individually or by lassoing them in the Instance Overview & Selection.
In the second step, after the initial selection, users are provided with additional details for the selected instances in the Detail View & Filtering view.They can use this information to compare instances and refine the selection by filtering out irrelevant or erroneously selected instances, simply clicking on the respective list item.
In the third step, the user assigns the corresponding labels to the selected instances in the Instance Labeling area of the interface.The user either has the option to assign specific class labels or to accept the suggested labels by the model which will be especially useful in an advanced state of the model later in the process.
Furthermore, the labeled instances are added to the training set and the model is retrained.Simultaneously, the model calculates new recommendations.After recalculating the utility of the instances and the recommendations, users should be made aware of new areas of interest that they may not have previously perceived as such (Fig. 2d).
This is an iterative process, starting with the first step displaying the updated model and the new recommenda-tions.The user repeats this procedure until satisfied with the model's performance.Then the model's predicted classes are accepted as labels for the remaining instances that have not been manually labeled.As a result, not only a labeled dataset and a trained model is obtained, but also insights about the data itself from exploring the visualization.
In the following, we illustrate both main components Candidate Suggestion and Result Visualization (see Fig. 3) of the approach in detail.While the first is model-based, the latter is an interactive visualization showing the data, the model state, and the visual cues to guide the user.

Candidate suggestion
The Candidate Suggestion calculates the potential benefit of each unlabeled instance and, based on this, automatically selects meaningful instances that are suggested to the user.For this purpose, as with conventional AL, query strategies are used.We denote the estimated benefit of an instance as utility, inspired by prior research [26].To calculate the utility, we use model-and data-based information, supported by Fu et al. [26] and He et al. [30] who consider not only the uncertainty of the model but also the representativeness of the instance when calculating the utility.The use of representativeness compensates for the fact that the uncertainty of the model considers the instances isolated from each other.Thus, potential outliers that are not representative for the dataset might get recommended.The representativeness prevents this by incorporating similarities between the data points.
Inspired by [26,30], we define the utility of an unlabeled instance x i as the weighted product of the uncertainty of the model and representativeness of the instance.For each instance, this weighted product is calculated.In our case, we chose to weight uncertainty and representativeness equally.For different problem settings or datasets, the trade-off between uncertainty and representativeness can be adjusted as needed.Leveraging the entire available dataset for utility computation, we achieve a very high degree of coverage [18] in terms of instances that are incorporated into the suggestions.

Uncertainty calculation
We use a probabilistic model for the calculation of the uncertainty, which represents the predicted result of a classifier as a vector containing the posterior probability of the classes learned by the model for each instance.Following the three user strategies from [9] (class borders refinement, class outlier labeling, and class intersection minimization), we use the margin approach to minimize the error rate in classification.There is criticism that this approach only considers the two most probable classes and neglects additional information when calculating the uncertainty.However, similar to user Fig. 3 Overview of the VisGIL approach based on the VIAL process [10].Our iterative approach includes two complementary main components: algorithmic models for Candidate Suggestion (green) and visual interfaces for Result Visualization (orange).[28] strategies [9], our user-based approach seeks to minimize classification errors by distinguishing clearly between specific classes-the key feature of the margin approach.The originally proposed metric [26] is adopted in the following way to obtain the difference between the posterior probabilities of the two most probable classes for each unlabeled instance x i : where ŷ1 and ŷ2 are the first and second probable class.With the traditional least margin approach, the model tries to separate the two classes with the highest probability as distinct from each other as possible, which is indicated by the largest possible margin between the probabilities.Instances with a small margin are therefore particularly informative for the model since in these cases the model is unsure.The complement of the margin (Eq.( 1)) is used as the measure of uncertainty: By this conversion, a small margin expresses a high degree of uncertainty, since in that case confusion can easily occur between the first two classes.To determine the uncertainty, a trained model is necessary.However, initially no labeled instances are available to train a model.Therefore, we start by assigning the maximum uncertainty of 1 to each instance, allowing to calculate an initial utility and thereby to recommend instances.

Representativeness calculation
To calculate how representative an instance x i is, our approach uses the information density I .The higher the information density of an instance, the more similar the instance is to the rest of the data, expressed by: where U denotes the size of the unlabeled dataset and x u the instances.As the similarity measure, we use the Euclidean similarity: where d refers to the number of attributes.Calculating the information density using Euclidean similarity prefers the centers of clusters.We assume that these clusters correspond to actual classes in the data with the identified centers being the most representative instances of these classes.Adapting the data-based user strategy Ideal Labels First [9], that focuses on the most characteristic instances of a class, these archetypes can be generalized and applied for the rest of the cluster's instances.

Utility calculation
Incorporating Eqs. ( 2) and (3) into one common equation yields the utility of an unlabeled instance x i : (5)

Diversity calculation
Based on the calculated Utility(x i ), particularly useful instances for training are determined and suggested to the user.A naive approach would be to suggest the top n instances with the highest utility.However, since these can be located very close to each other, superfluous instances may be labeled.To distribute the recommendations meaningfully over the entire dataset, we use kernel k-means clustering [20] as proposed by [30].Through this mechanism, we aim to direct the users' attention to novel regions [18] in the dataset so that they can discover new analysis directions and insights within those regions.However, we do not use all unlabeled instances for clustering, but rather a high-utility subset.This way instances with little to no impact on the training are sorted out in advance.We choose the instances closest to each of the k calculated cluster centroids.Thus, k instances are recommended which are both useful and diverse.We set the amount of clusters equal to the number of classes providing the user with a diverse range of different suggestions [18] after each training step.

Learning model
We use a random forest classifier (RFC) [12] for the prediction of class labels.An advantage of RFCs is that they directly yield the probability of each class for each instance, which is necessary for the uncertainty calculation.Besides, the classifier is fast in training and prediction, which is important for seamless user interaction and allows the approach to scale well for higher-dimensional data.Respective results must be provided to users as quick as possible, otherwise they will lose interest and attention due to long waiting times.The RFC delivers good accuracy reliably and easily.The standard configuration of scikit-learn [45] was used in our implementation (100 trees, Giny impurity, no restrictions on the depth of the trees).Besides determining the uncertainty for the utility calculation, the classifier is used for the automated labeling of all unlabeled instances.These labels are tentative, users can confirm them manually.During the training process, predicted labels may change as the classifier is retrained.

Result visualization
The second component (see Fig. 3) is focused on conveying the data as well as information of the model to the user.We use visualization since this enhances the cognitive abilities of humans by methodically preparing and presenting data in interfaces with interactive visualization [25].Thereby, the user can gain deeper insights into heterogeneous and complex data through explorative knowledge discovery [25].

Data visualization
When displaying the instances to be labeled, the high dimensionality of most datasets can cause difficulties in the visualization.Although there are forms of representation such as parallel coordinates [35] or radar-plot [51] which can represent many dimensions, they can become cluttered when there is too much data, or they cannot reveal the relationships and clusters in the data.Hence, projection methods are used to reduce the dimensionality and display the data with a 2D scatter plot.Sedlmair et al. [54] showed how a scatter plot of projected data can clearly display clusters in the data.
For the projection of the data, our visualization approach uses t-distributed stochastic neighbor embedding (t-SNE).This algorithm is widely used in ML and visual interactive labeling due to its "ability to create compelling two-dimensional maps from data with hundreds or even thousands of dimensions" [76] effectively.The method is occasionally criticized for creating visualizations that lead to wrong conclusions, as they distort the dataset or present nonexistent clusters.However, such scenarios can be avoided by correctly configuring the main hyperparameters [34,76].The methodical configuration of these hyperparameters was not part of this work and remains an open challenge for future work.For our prototype, we experimentally set perplexity = 20, maximum iterations = 5000, and used the defaults of scikit-learn [45] for the remaining parameters.This visualization allows users to apply data-based strategies [9] to identify interesting instances based on the location and relation of the instances.

Model visualization
To incorporate model-based strategies, the visualization is extended by information from the model.Regarding recognized model-based strategies for instance selection, users always fall back on the model's predicted labels to evaluate the current state of the model.Thus, we extend the visualization so that for each data point the predicted or manually assigned class label is indicated by color.Each class of a dataset is thus assigned a unique color.This allows users to immediately recognize which class each instance is assigned to.
Bernard et al. [10] raised a concern whether predicted labels should be displayed, as they might lead to possible bias among users.We argue that this information is crucial for users to evaluate the current state of the model.Furthermore, Bernard et al. refer the criticism primarily to collaborative scenarios where labels of one analyst might influence the decisions of another analyst, while our approach is focusing on a single-user setup.
To distinguish labeled from unlabeled data points, they are shown by different icons in the scatter plot, showing the coverage of labeled data points.Moreover, this provides users with a visual indication of their impact on the system, since not only the icons of the instances manually labeled by the users change but also the color, size, and icons of the other instances.Visualizing the model's improvement achieved by the users should increase their perceived relevance in the process and motivate them [43].In addition to shape and color, a further visual cue is offered by encoding the size of the icons w.r.t. the calculated utility of each instance (see Eq. ( 5)).The higher the calculated utility of an instance, the larger its diameter.

User guidance
The approach supports users by the first two degrees of guidance [14], i.e., orienting and directing, in the selection of instances.With the help of visualization, users can compare their mental models containing ideas about an ideal model with the current state of the model.They can evaluate the current model and derive steps to label and add instances to the training set to improve the model.To guide users to identify useful instances, the measures introduced in the Candidate Suggestion step are used (see Sect. 4.1).
To involve the model in the instance selection, the visualization is used for the suggestion of instances.This allows users to consider them when deciding which instances to label.A highly distinct icon of increased size is used with a signal color not used for any other class.Thus, the model's recommendation clearly stands out and can be easily noticed by the users (see Fig. 2a).Inspired by findings on the perception of interactivity, we exploit visual cues such as shape, color, and size to direct users' attention to specific areas in the visualization (see Fig. 2b).Especially the star icon, as an "abstract icon" [11], is predestined to convey the information that the respective instance is informative.Many modern interfaces use the star icon as symbolism for favorites or featured items.
Visual cues and how they evolve after repeated training steps can be used to guide the user's attention to specific areas in the visualization (see Fig. 2d).By changing the visual cues along with their underlying context, the users can compare the current state as well as the development of the model with their internal mental model.Using any mismatches existing between the two models, along with the visual cues themselves, possible actions can be derived.Consequently, the ML model and the human are both involved in the decision-making process about which instances are interesting or useful and therefore should be labeled next (see Fig. 2c).
Adapting the goals for guidance formulated by Colins et al. [18], we aim to support the users with useful starting points in an unknown dataset.Moreover, by visualizing the model, data, and suggestions, it is our intention to make them aware of possible biases of the same without harming the reasoning process along the way.Consequently, instance selection, information in the visualizations and displayed regions is solely assisted by the model, but never prescribed.

Interactive selection
The presented interactive user interface allows to interact with the visualization, the data, and thus indirectly with the model.Besides simple interactions following the information-seeking mantra [62], like changing the displayed area of the visualization by moving, zooming, or filtering, users can also perform semantic interactions.These include the selection of instances and the assignment of labels.Users can select and label several instances with one interaction to make the process more efficient.The selection of instances to be labeled is done exclusively by the user, guided by the discussed visual cues.The instances recommended by the model are intended to support the user in their selection.In contrast to pure AL, the user is free to follow or ignore the suggestions, thereby taking a more active role.

Design and implementation
We implemented a prototype to evaluate the effectiveness of our approach.The logic behind the utility calculation, recommendations, and projection was implemented using scikit-learn [45] and modAL [19] in Python.For the user interface, a javascript-based web application was implemented, using Plotly [46] for interactive visualizations.The user interface consists of four main components (Fig. 1):

Instance overview and selection
The Instance Overview & Selection area is the central element of the workspace (see center of Fig. 1).It shows the projected instances as points in the 2D-scatter plot with the corresponding visual mappings for the status, class, and utility.An alternative for the case of image data would be, instead of showing the data points, to show the images as thumbnails in the scatter plot as done in [63].This would be beneficial for small images, however, might not be transferable for larger images or other data types.
A legend shows the color assigned to each class, which is used consistently throughout the tool to identify that class.Labeled instances are represented by an outlined diamond with a dot, to distinguish them easily from unlabeled circles.Suggested instances, indicated by red stars, visibly stand out from surrounding instances.Using the toolbar, users can move the displayed area and zoom in.Per default, the lasso tool is active, allowing users to select interesting instances by clicking and circling them.In this view, users can select instances from potentially interesting areas for review and labeling.

Detail view and filtering
All instances selected in Instance Overview & Selection are displayed in descending order according to their utility in the view Detail View & Filtering (see bottom right in Fig. 1).The utility (see Eq. ( 5)) of the instance is shown by a utility bar so that it can be better perceived and compared than a plain number.In this list view, additional details about each selected instance are shown along with the manually assigned or predicted label.The corresponding symbol of each instance is displayed so that instances in the list view can be associated with data points in the visualization.Moreover, the current instance that the user hovers over in the list is highlighted in the visualization.Along with the generic symbols, users are also able to leverage a compact visual representation of the instances' features to gather detailed information.For image datasets, users can perceive properties and features of the individual instances very easily, as their features are color and/or brightness values of the individual pixels as well as their location in the image.These characteristics render the original image the most convenient way to represent such instances in a comprehensible way.Therefore, when working with image data, small preview images are included in the list view as compact representation of the instances' features.For more complex, multi-modal data, an appropriate visualization should be embedded.As the focus of this work is on guiding users in their selection, this challenge is beyond the scope of this work and can be addressed in further work.By clicking on the corresponding instance in the list view, users can filter out non-interesting or unwanted instances from the selection.With this view, selected instances from the overview can be examined more closely and filtered if necessary.

Labeling interface
Users can assign class labels to the remaining selected instances in the Labeling Interface (top right in Fig. 1).For our prototype used for the two experiments with two different datasets and class alphabets, a preset configuration containing the class alphabet was provided in each case.For future iterations of the tool, it should be possible to predefine a known class alphabet via an editable configuration file, which is automatically read in during startup to populate the buttons of the labeling interface based on the given dataset.Additionally, to make the tool robust toward explorative scenarios where the number and alphabet of classes are initially unknown, the labeling interface could be easily extended to add new classes during the labeling process.Thus, the label alphabet could grow together with the user's insights.A similar approach to iteratively extending new classes is employed, among others, in the labeling tools of Ritter et al. [50], Beil et al. [6] and Zahálka et al. [85].
In addition to the explicit assignment of labels, users can accept labels predicted by the model.This is especially useful in an advanced state of the model, where predictions are usually correct.Thus, at the end of the interactive labeling process, all remaining unlabeled instances can be automatically labeled by the model with one button click.

Support visualizations
On the left of the workspace (Fig. 1), additional information about the model's current state and the distribution of the labels is shown.For the prototype, the calculated accuracy of the model was displayed for orientation.This information is not available in a real application but can be replaced by information such as hyperparameters of the model.Also, data on the distribution of the labels are shown.Since the class alphabet used for the prototype was known, the user can be informed about classes that have not been labeled yet.This is an attempt to prevent classes from being overlooked.A bar plot shows the distribution of the assigned labels over the respective classes encouraging users to distribute labels evenly to prevent bias in the model.For datasets with imbalanced classes, this plot can be extended to include predicted labels allowing users to see if the model reflects the actual known distribution of labels.If the distribution is unknown, this view can help to develop a mental model of the dataset.With so many diverse visualizations designed to convey information, it is very likely that even without a firm strategy, users might also be able to gain valuable insights through serendipity [18], ultimately carrying them forward during the process after all.For the purposes of our prototype, we have focused on simpler visualizations that are well-known among the majority of users, since novel and more complex types of visualization require extensive and comprehensive onboarding [64] of users before they can use the tool.

Evaluation: user study
While the research question RQ 1 was addressed by the design of the approach itself, RQ 2 of how guiding users affects their selection of instances and labeling strategy as well as the resulting classifier, was evaluated with a user study.Com-parable studies used probabilistic models to simulate the interactions with AL methods.However, these studies either focus on the reliability of the label assignment or the databased selection.Since we aim to evaluate the effect of user guidance, we opted for a study with real users.Thus, a prototype implementing our approach was used by real subjects to assess the impact of guidance on users.The aim of the study is to evaluate how the proposed user guidance (utility and recommendations of instances) affects the users' selection and in turn the accuracy of the classifier.

Datasets
Two datasets were used during the study: MNIST [41] for the user study in this section and Fashion-MNIST [82] in the next section to evaluate a more complex setting.We opted to use the rather simple but publicly available MNIST dataset in our user study for a number of reasons: (1) related work that our work bases on uses MNIST, hence it allows the research community to relate our work to previously published research, (2) the reason that the dataset is widely known and easily understandable allows to easily understand the problem setting and approach without specific data knowledge, (3) interactive labeling of the data does not require expert knowledge allowing to acquire participants for the user study more easily.
In order to keep the difficulty of the task as low as possible, the MNIST dataset was used for the tasks to be performed by the subjects.The individual instances are classified representations of handwritten numbers from 0 to 9, each number being represented by a 28 × 28 pixel gray-scale image that describes the brightness of each pixel with an integer value from 0 (black) to 255 (white).This produces white digits on a black background.The dataset contains a total of 70.000 instances divided into a training set including 60.000 and a test set including 10.000 instances.
To speed up the classification performance, the original dimensionality of 28 × 28 = 784 was reduced to 42 dimensions using a Feature Extractor.The descriptor proposed by Bernard et al. [7] calculates the accumulated brightness readings of 11 horizontal, 11 vertical and 20 diagonal sections based on the original image.Although this extractor does not correspond to the recommended standard extractors, which are usually applied for image processing [40], this approach is very easy to understand and to reproduce.
To familiarize the subjects with VisGIL, the Fashion-MNIST dataset was used.This should avoid memorizing characteristic features of the data for the study and thus avoid learning effects.The structure and size of the Fashion-MNIST are designed to be exactly the same as that of the MNIST dataset, so that the two datasets can be exchanged without having to adapt the ML algorithm in use.Unlike the MNIST dataset, this dataset contains grayscale images of 10 different classes of clothing and accessories.These classes are not as easy to distinguish as the MNIST handwritten digits.

Participants
For the user study, 16 students (5 female) were recruited as participants.Each participant took part in the hands-on session and both parts of the experiment.All participants have at least a bachelor's degree, were in the second semester of a data science master and have already attended lectures on ML, data mining and visualization.

Procedure
The experiment was conducted remotely via a video conferencing tool.Each subject used his or her private workstation with different sizes, resolutions and color settings of their displays.The classifier's training time and utility calculation were not affected by the technical specifications of the workstations used, as all calculations were performed on a common server.Our prototype was deployed on a cloud instance (Intel Xeon Platinum 8124 M at 3.4 GHz, 16 CPU cores, 42 GB RAM).
Initially, all subjects were given a general introduction to the topic of visual interactive labeling as well as a detailed introduction to the presented approach with the possibility to ask questions at any time.After the presentation and introduction of the tool, the subjects were extensively exploring the tool for approximately 15 min in a hands-on session with the Fashion-MNIST dataset.The subjects were shown how to train and evaluate the classifier through the interactions.
The study was subdivided into two parts: In the first part, the subjects had to solve the data labeling task under two different setups.In the second part, the subjects were asked to complete a questionnaire for self-assessment and evaluation of our approach.The overall time to perform the two parts was estimated with 30 min, depending on the extend of the questionnaire.Since the experiment only took a relatively small amount of time, no breaks were scheduled between the two tasks.A detailed description of the experimental designs for each part follows.

Task description
Following a within-subject design, each of the subjects had to complete two tasks, where the first one had to be completed with a lower degree of guidance and the second one with a higher degree of user guidance (see Table 1).The subjects were asked to select and label data points according to their own preferences.In addition to potential patterns in the visualization, subjects were able to assess additional information for their decision regardless of the independent variable.
A bar chart displayed the distribution of already labeled instances over the different classes.This should motivate the users to assign the labeled instances to all classes and reduce imbalances in favor of a few classes.The subjects were given the task to optimize the accuracy of the classifier displayed in the interface as the task progressed.They were informed about the existing classes in the used dataset before starting the experiment.
Pre-studies conducted by us have shown that both the display of a time limit in the tool as well as the mere awareness of such a time limit is too distracting for the subjects.These focused mainly on labeling as many instances as possible before time expired rather than concentrating on a carefully reasoned selection.Hence, to avoid putting pressure on the subjects, they were not given a time limit for processing both setups.Nevertheless, we still kept the time in the background and stopped the tasks after 10 min.

Setup
The study was subdivided into two subsequent setups (see Table 1) using the same prototype of VisGIL (depicted in Fig. 1).
Setup S 1 corresponds to a user-driven selection of instances which is then supported by the AL model's calculation of the selected instances' utilities.We categorize this as guidance degree orienting, the lowest degree of guidance according to [14].In addition, the current state of ML model, i.e., its predictions for the data points, is shown by class colors.Using the shown class predictions and the provided utility measure, users should be able to match their mental model with the current state of the ML model and consequently deduce which instances are useful to label.
In setup S 2 , which presents the next higher degree of guidance directing, users are offered possible options to achieve the goal.In contrast to S 1 , S 2 is extended by mapping an instance's utility (see Eq. 5) to the size of its visual representation.Furthermore, explicit recommendations of particularly useful instances are computed and visually highlighted to draw the user's attention to them.
So both setups contain information about the model like predicted classes or the computed utility of each instance.The only difference between the two setups is that users in S 1 have to select the candidates to be labeled completely on their own with the help of the given information, whereas in S 2 they are also explicitly pointed to said candidates.Our intention is to study the influence of the two different guidance degrees.For our approach, S 2 differs from S 1 in, that it encodes an instance's utility by the size of the icon and that it shows the recommendations as a star icon.
For each setup, a random subset of 1.000 instances, 100 per class, was selected from the training dataset.The random seeds used were the same for all subjects, but differed between the setups to avoid possible memorization of striking areas.To evaluate the trained model, 450 instances were randomly taken from the test dataset.This partitioning is based on a 70/30 train/test split.

Qualitative user feedback
After completing the two tasks, the subjects were asked to complete a digital questionnaire (see Table 2).The questionnaire was designed to determine how users assess and perceive the influence of guidance on their selection.Besides, it should be evaluated which particular factors influence the users' selection of instances to better understand user guidance for labeling data.

Results
All subjects successfully completed both tasks and the subsequent questionnaire.The aim of the user study was to evaluate how user guidance affects the accuracy of the trained model (effectiveness) and the number of labeled instances (efficiency).Hence, to evaluate and compare the performance of the model building process.As part of this, we also studied how user guidance affects the users' selection and, consequently, the distribution of assigned labels.Therefore, we recorded the sequence of labeling events during the experiment for each subject.A new event was created each time instances selected by a user were labeled and the model was re-trained and evaluated.Each event contains a timestamp, the model's test accuracy and the number of labeled instances.After completion, also all instances were recorded with their user-assigned labels.
The resulting time-indexed sequences were used to compare the development of the model's accuracy over time (Fig. 4) and the development of the number of labeled instances over the time (Fig. 5) between the two tasks with different degrees of guidance.The influence of the guidance on the models was tested statistically.The list of all instances with their assigned labels was used to create and analyze the distribution of assigned labels across all classes in the dataset (Fig. 6).The distribution of the answers' tendencies toward perceived confidence (Fig. 7a), difficulty (Fig. 7b), and the influence of guidance (Fig. 8) were analyzed and compared.To analyze further user impressions and opinions, we conducted an explorative analysis of the answers given.

Procedure
The users individually started and stopped their tasks within the designated maximum time span.This resulted in recordings of different lengths.For better comparability, we shifted all measured time series to a common origin and shortened them to 496 s length, which is the length of the shortest mea- In Fig. 4, the dotted horizontal line marks the theoretically maximum achievable performance of the model with the available data of 1.000 instances and the feature extractor used.To evaluate the guidance's influence on the results, we conducted hypothesis tests.

Model Accuracy
Fig. 4 Accuracy of trained model (y-axis) on an unseen test set over the duration of the task (x-axis).Models including higher degree of guidance directing show higher accuracy than models with the lower degree orienting, both on average and in IQR The visualization of the model's accuracy over time (Fig. 4) shows that the accuracy of the trained model increases steadily over time, regardless of the user guidance.It can be Fig. 5 Number of labeled (y-axis) over the duration of the task (x-axis).Users with the higher degree of guidance directing have labeled more instances in the same time span seen that both approaches work similarly well at the beginning.From about 200 s on, it can be seen that the approach with guidance degree directing achieves higher accuracy in less time.With increasing time, the variance for the approach with guidance degree directing decreases more than for the approach with guidance degree orienting, as can be seen from the narrower colored area.At the end of the task, the accuracy values of the two approaches differ visibly.

Amount of Labeled Instances
In the visualization with the number of labeled instances over time (Fig. 5), it can be seen that users have, on average, labeled more instances in the same time.Regardless of the guidance, the number of instances increases approximately linearly over time.Both curves show some abrupt increases, especially for users with guidance degree directing at the end.While the spread of values in the IQR remains stable for users with guidance degree directing beyond 150 s, except for a few outliers, it increases for users with guidance degree orienting.The time between the labeling interactions of the users-the time they took to analyze and actually label data points-is visible in both curves by the staircase shape and the curve running parallel to the x-axis at the very beginning.

Hypothesis test
Our experimental results (see Figs. 4, 5 and Table 3) show that the participants performed better in S 2 regarding both measured model accuracy as well as labeled instances.This suggests that our proposed implementation of the guidance degree directing is superior.In order to determine the statistical significance of these improvements, we performed a series of hypothesis tests.The following null hypotheses were tested: H 1 null : The accuracy of the ML model on the training set with guidance degree directing is not higher than with guidance degree orienting.H 2 null : The accuracy of the ML model on a blind test set with guidance degree directing is not higher than with guidance degree orienting.H 3 null : The number of labeled instances with guidance degree directing is not higher than with guidance degree orienting.H 4 null : The improvement in accuracy per interaction with guidance degree directing is not higher than with guidance degree orienting.H 5 null : The ratio of instances correctly labeled with guidance degree directing is not higher than with guidance degree orienting.
Paired tests were used since all subjects used both setups and therefore the results are dependent.Since we observed better results for S 2 for all considered aspects, we use a onesided t-test, as opposed to a two-sided test.We tested for the differences in the two samples' (S 1 , S 2 ) mean values with a significance level of α = 5%.As null hypotheses we assumed no differences between the two samples (μ d = 0).The alternative hypotheses were formulated such that guidance degree directing is superior to guidance degree orienting.
In Table 3, it is shown that we can clearly reject the null hypothesis H 1 null and H 2 null , i.e., the accuracy of the resulting models with guidance degree directing is indeed significantly higher (p-value of 0.005 and 0.030, respectively)2 .Calculation of the effect sizes according to Cohen [17] indicates a medium size of d = 0.51 and d = 0.73 of the degree's effect on the models' accuracy.
While the number of labeled instances is higher with guidance degree directing, we failed to reject H 3 null .Regarding H 4 null , the accuracy improvement per interaction is higher with guidance degree directing, however, with a p-value of 0.067 we failed to reject H 4 null with the used α = 0.05 by a small margin.Finally, the number of correctly labeled instances is not statistically significantly higher, i.e., we failed to reject H 5 null .In both cases, a proportion of approximately 5% of instances labeled by users is incorrect.

Label distribution
It can be seen in the visualization showing the distribution of labeled instances over the given classes (Fig. 6) that the labels are distributed unequally regardless of the users' guidance degree.In either case, the distribution of labeled instances over the classes is similar.The same classes are either over-or under-represented.Larger visual differences  6 Distributions of label counts (y-axis) over the classes in the dataset (x-axis) between the results with guidance degree directing (red) and guidance degree orienting (blue).In both cases, the instances are not equally distributed but imbalanced with similar tendencies toward the same classes between the distribution of the users with different degrees of guidance are apparent in classes 4 and 6 where more instances were labeled on average in the task with guidance degree directing.However, it must be kept in mind that in the task with guidance degree directing, generally more instances were labeled on average than in the task with the lower degree.The distribution is consistent with the feedback of some users, that instances of certain classes were difficult to find for both tasks.Besides, clusters of several classes were clearly distinguishable from the other instances in the Instance Overview & Selection (Fig. 1).

Qualitative questionnaire
The answers to the questions in Table 2 were analyzed regarding perceived confidence (Fig. 7a), difficulty (Fig. 7b), and the influence of guidance (Fig. 8).Results of this analysis of users' impressions and opinions about the tool and its two setups are presented next.Perceived Confidence Users' self-assessment of the perceived confidence in identifying interesting instances (Fig. 7a) reveals that the Fig. 7 Perceived confidence of users a when selecting instances with guidance degrees orienting (Q 1 ) and directing (Q 3 ).No subject stated being insecure in S 2 .Perceived difficulty of users b in identifying instances with guidance degrees orienting (Q 2 ) and directing (Q 4 ).The number of subjects having difficulties identifying instances for labeling decreased by more than half compared to S 1 with the lower degree of guidance.[28] majority of subjects felt confident in finding interesting instances regardless of guidance.None of the subjects stated that they were insecure in their self-assessment of S 2 including guidance degree directing.16 of 19 subjects felt confident in the more guided selection.
Users' perception of S 2 shows that the majority of subjects found it difficult to identify interesting instances for labeling with the lower degree of guidance (Fig. 7b).For S 2 including guidance degree directing, 3  4 of the subjects reported that it was easy to identify instances.Only 3 subjects found it "very hard."Compared to S 1 with guidance degree orienting, the number of subjects who found it easy to identify interesting instances more than doubled in S 2 .Perceived Influence The majority of subjects stated that they were strongly influenced by the higher degree of guidance and that this influence was helpful.Half of the subjects said that the highlights and recommendations encouraged them in their choice.Only two respectively three of the 19 subjects reported that the this guidance had hindered or confused them.Part of them stated that they did not make use of the offered guidance and did neither pay attention to the icon size nor the recommendations.One subject claimed that the recommendations Fig. 8 Perceived strength a, type b, and perceived usefulness c of the guidance's influence.[28] introduced a bias toward certain instances and classes, thus compromising his/her intuition suggesting that guidance by icon size only would be sufficient.

Feedback on the Tool
The subjects provided plain text feedback on the two setups of the tool.Two subjects liked about S 1 with the lower degree of user guidance that they were not influenced or distracted by the recommendations.They did not have to consider the stars and focus more on their intuition.Few subjects stated that they recognized the distribution and clustering of instances well via the density of data points.However, the majority of subjects reported in S 1 that they often had no orientation on what instances should be selected and labeled next.In general, it was difficult to find interesting instances due to the unclear presentation of the data points.
For S 2 of the tool including recommendations and varying sizes of the icons, subjects said that it gave them clues as to where it would make sense to label.More information was available to find instances.Subjects felt that the model's accuracy increased quicker and that they had more confidence in their selection.On the other hand, some subjects criticized S 2 for the fact that they felt influenced by the stars in their choice.Some subjects even felt constrained by the stars to label them.Users admitted that they only concentrated on the stars or the area immediately around them and developed a kind of "tunnel vision."A small number of subjects were overwhelmed by the additional information in this setup.
13 of 19 subjects preferred S 2 including guidance degree directing.They stated that due to the stronger guidance S 2 is clearer, faster, and more effective to work with.They liked that the tool allows for a more confident selection of instances, as the tool reinforces their decision.While it might seem obvious that the stronger guided S 2 is better for navigating users toward interesting instances, we argue that any general navigation solution might also interfere with the users' intuition and decision-making.Our study showed that some, yet only very few, subjects preferred the less guided S 1 stating it was clearer and easier to use.
In general, users enjoyed the clarity and easy handling of both setups.Apparently the tool has made a rather exhaustive and repetitive task enjoyable.For future revisions, users demand more convenience through features like hotkeys, the ability to filter labeled data or to toggle the degree and presence of guidance on and off.Interestingly, the subjects expected to be able to train a model with 100 percent accuracy.A few subjects got frustrated that after a certain point, the accuracy didn't seem to increase.This user-perceived limitation is caused by the feature extraction descriptor used in the study with the resulting loss of information.The highest achievable accuracy of the classifier was just 88.11%.This result was achieved by using all 1.000 correctly labeled instances of the subset used for the experiment to train the classifier.

Evaluation: think-aloud test
To evaluate our approach for a different, more challenging task, we conducted an additional user study accompanied by a think-aloud test.For this, we used the more complex Fashion-MNIST dataset, where classes are harder to distinguish.The participants were given the task of labeling data such that the model achieves an accuracy of ≥ 50 %.This task was conducted as a think-aloud user study [71] without time limit and without recording of the quantitative results.The intention was to learn from the ad-hoc feedback about the users' decision-making process and whether they included the visual cues we proposed for user guidance in it.

Procedure
First, each of the subjects received a brief introduction to the prototype, including a hands-on session as well as an explanation of the Fashion-MNIST dataset along with the task description.While completing the task, the participants were encouraged to speak their thoughts aloud.Along with an audio recording, we additionally made notes about interesting observations and notable points for each subject during the completion of the task.Subsequently, the subjects were asked four follow-up questions about the visual cues and the tool in general.

Results
From the think-aloud test and the interview, we were able to derive the following three common processing and strategies: (1) Early in the task, we observed that subjects focused the model's recommendations in the form of the data points highlighted as stars.In doing so, subjects often chose stars that were located at the outer edges of the t-SNE projection.Subjects mainly selected a moderate number of 10-15 instances arranged in a circle around the star using the lasso.Only one of the subjects selected a much larger set of instances.A further insight was, that (2) the strategy of selecting stars and surrounding them in a circle was largely maintained by the subjects throughout the entire session.Two of the subjects decided after a while to select individual stars, but this was immediately discarded.In addition to the stars, (3) the subjects reported to increasingly focus on areas with large data points or dense areas during the course of the processing.One of the subjects tried to select a point cloud containing only small data points at a later stage to locate a missing class.However, since this selection did not lead to the expected result, it was immediately discarded and large data points and stars were preferred.At an intermediate stage of the task, when four to five of the ten classes had already been identified and labeled, one of the subjects focused more on the colors of the classes in order to identify missing classes or to correct errors of the model in the prediction.
In general, we observed for all subjects that the search and selection of instances were highly exploratory at the beginning.Similar approaches and strategies could be observed among the subjects as to which areas were considered potentially interesting.Primarily, the model recommendations in the form of large icons and stars were used, as well as the clusters generated by t-SNE in the 2D visualization.With about half of the ten classes identified and labeled, the exploratory approach shifted to a goal-oriented approach in searching.Thus, two of the subjects focused on instances mislabeled by the model.When they realized that the misclassification was due to a missing class, they then immediately tried to find and label it in the visualization so that this class was made known to the model.All of the subjects eventually reached a point where about eight to nine of the classes had been identified through the iterative selection and labeling of instances.The challenge at this point was to then find the remaining missing class(es).During this search, the spatial distribution of the instances in the plot as well as the shape of the colorcoded classes played a major role.In addition, labeling was used to try to distinguish similar classes more clearly.
During the filtering of the selected instances, a conservative and careful behavior was observed among the subjects, in which instances were filtered very carefully based on the preview images to avoid labeling any of the instances incorrectly.When filtering, the subjects focused primarily on the thumbnail representation of the instances; only one of the subjects stated that he had used the utility bar to ensure that influential instances remained in the selection and were labeled in any case.For one subject, we observed that after assigning the label, he/she realized that one of the selected instances did not match the assigned class anyway.However, a correction of the labeling decision was not possible for him/her, since he/she did not remember the previously selected area.As a result, the error was intentionally ignored and accepted.
Supporting visualizations in form of the accuracy progress bar, the indication whether a class was already found and the distribution of the labeled instances over the classes were used by the subjects when about half of the instances were already labeled.Thus, subjects reported being made aware of missing or underrepresented classes by the histogram and the green/red indicator.Furthermore, subjects perceived the accuracy progress bar as a great motivator and indicator of how well they had already accomplished the task.According to all of the subjects, working on the task with the help of the presented tool was a lot of fun and motivating for them.One mentioned reason were the easily understandable and consistent visualizations like the histogram and the consistent use of the colors in the overview as well as the lasso selection tool.Also, the meaning of the icons in the overview was easy to understand-one subjects mentioned that the provided legend is superfluous and might waste space.
For future revisions of the prototype, users would like to see labeling buttons that match the colors of the classes in the overview and provide additional information, as the histogram or the green/red indicator.Further improvements were mainly related to the poor resolution of the preview images of the Fashion-MNIST dataset, but being independent of the tool.

Discussion
The findings of our quantitative user study support the assumption that model performance can be improved by guiding users in the selection of instances [7].This is supported by indicators such as superior classification accuracy (Fig. 4, Table 3-H 1 null & H 2 null ) and increased number of labeled instances (Fig. 5) during the task.This was the expected result, as users had additional information provided by the model when selecting interesting instances.
These indicators of the two setups are closer to each other than one would intuitively assume, since guided users should perform significantly better compared to users of the control group.However, both setups do not compare guided versus unguided users, but a higher degree of guidance with a lower one.Thus, the slightly improved performance of users in S 2 reflects the explicit recommendations of candidates, as all other variables remained constant.
In terms of user selection performance, our results are similar to those of Bernard et al. [7] and Chegini et al. [16], although our accuracy curve is slightly flatter and thus more stretched.Such variations, however, can be explained by variables such as different participants, tools, instances, and projections.
The colored IQR areas in the visualization of our results (Fig. 4, Fig. 5) suggest that user performance varies considerably between subjects and is furthermore prone to errors regardless of the degree of guidance.This error becomes evident in the two mean percentage values of correctly labeled instances in H 5 null (Table 3), which indicate approximately 5% mislabeled instances.Although this error seems marginal, it can have a significant impact on AL and VIL scenarios, since their basic concept is to require users to label only a few, but highly influential instances for the training process.Thus, even a small number of incorrectly labeled instances may have a substantial impact on the model's accuracy.
The MNIST and Fashion-MNIST datasets contain instances that can be misinterpreted, for example, because of unclear handwriting or semantic overlaps.Furthermore, the lasso tool from the prototype allows multiple instances to be selected at once, which might lead to critical instances being overlooked and mislabeled.
Users make errors, get exhausted, and lose attention overtime during repetitive tasks such as labeling instances.Therefore, while evaluating techniques that are prone to user error, we argue that one should not compare user performance with an error-free algorithm.This imperfect behavior should be taken into account when using statistical models to simulate users and their selection and labeling strategies [55].
Sacha et al. [52] suggested automatic recommendation of data points and possible actions would be an effective way to make domain experts aware of interesting items.Both the results of our qualitative questionnaire (Sect.5.5.5) and the observations from the think-aloud test (Sect.6.2) support this idea, as users claim to feel more oriented and confident in their actions as a result of the recommendations.Furthermore, we consider our results to be a confirmation of the three degrees of guidance according to Ceneda et al. [14].While users in setup S 1 were able to determine something was wrong with the ML model, they were often not able to derive required steps to improve the ML model.This was only possible with S 2 , which recommended alternative solutions along the path toward a good model.Moreover, we observed that VisGIL supports the statement of Amershi et al. [3] that users do not want to merely take the role of an oracle giving simple labels, but rather prefer to be involved in the interaction and to understand the model better.
In our approach, we assumed that visual cues, such as the size, shape, and color of records, are a way to draw the attention of users to specific areas and instances.During the think-aloud study, we confirmed this assumption, as subjects mentioned that highlighted stars and the various sizes and colors of the icons were influential in selecting instances.
Aside from the visual cues as the main impacting factor, we were able to observe and confirm other user strategies during the think aloud test, such as Dense Areas First, Class Borders Refinement, and Class Intersection Minimization, which were stated by Bernard et al. [9].Considering the statements of a minority of subjects about the perceptually better perceived density and distribution of instances in the clusters at S 1 , future approaches could focus on developing a visualization that directs the users' attention to specific data without affecting its representation.
A surprising feedback we received from two subjects revealed a negative impact of guidance on user selection: the recommended instances not only draw users' attention to certain instances, but can also distract users during their individual selection.As a result, these subjects felt that recommended instances had to be labeled in any case, causing them to stop thinking for themselves from then on and to adopt the recommendations unchecked.These recommendations were mistakenly perceived by them as a prescribed solution path resembling the third degree of guidance [14].Thus, visualization of the recommended instances might be too strong.It is necessary to choose a degree of guidance with proper visualization leading users toward a goal without rendering them blind to alternative solutions.
A final interesting insight and promising research direction becomes apparent considering the statistically significant difference in the accuracy on the training set and test set (Table 3 -H 1 null & H 2 null ) that we observed during our experiments in Sect.5: We believe that a potential risk in interactive labeling is, what we in an earlier publication referred to as "manual overfitting" [66].Users might subconsciously label toward easier class separability for the ML model, rather than their knowledge about the data.We propose to consider the dilemma of potential underfitting and overfitting during the interaction between users and ML models analogously to the training of ML models.A research direction can be the formulation of cost functions with some regularization terms involved-aiming to regulate the user interactions.Interestingly, in our experiments, the users seemed to rather underfit the model-a state that would be helpful to explore using aforementioned formulation.

Conclusion
We presented VisGIL, an approach to interactively label instances, providing a combination of user-based selection and model-based suggestions to identify candidates for labeling (Sect.4).Based on an AL query strategy, the estimated benefit for all unlabeled instances as well as recommendations of particularly useful instances are calculated and then presented to the user via visual cues in an interactive 123 visualization.The approach was evaluated with a user study (Sect.5) and a think-aloud test (Sect. From the studies, we conclude that (1) user guidance has a positive effect on the accuracy of the resulting classifier, (2) the number of labeled instances is higher, yet this difference could not be shown to be statistically significant.Further, (3) the use of visual cues in an interactive visualization is a promising strategy to guide users, however, (4) the influence of the individual cues on an accurate representation of the dataset must be considered, otherwise, distortions may occur.
Potential limitations of our work arise from the use of the MNIST dataset where the classes are relatively easy to distinguish.Future work could be to evaluate the scalability w.r.t. the number of classes.While the approach is likely to scale to some extent, a very high number of classes will overwhelm users.We believe that a high number of classes is not in the scope of interactive labeling on a per-class basis.A potential solution could be hierarchical class structures, as given for example in the CIFAR-100 or the ImageNet dataset.
The use of more complex data or different data types requires an adjustment of the preprocessing pipeline.A promising approach for the projection of complex data was for example presented in [2], where multivariate time series are projected onto a two-dimensional space.
Furthermore, our experiments were conducted with 1000 data points.For massive datasets a limiting factor regarding the scalability of the approach is the need to iteratively retrain the ML model.A possible solution would be to continue training a previously trained ML model using newly available labeled instances, rather than training a model from scratch using the entire data ("cold-start"), similar to the shrink and perturb technique for neural networks proposed by Ash et al. [4].For large datasets or complex models, specialized architectures [36,65] and software libraries [48,77,78] leveraging GPUs may be utilized in the implementation to ensure minimal latency and real-time feedback making the tool interactive for its users.Beyond the use of hardware acceleration and more computing power, the use of distributed systems provides another lever to scale the approach introduced for increased data points and more complex models, involving the processing of data and models on multiple nodes in parallel.Generalist approaches such as MapReduce [27] or ML-focused approaches such as TensorFlow [1] for example can be utilized.Fully managed software-asa-service solutions from large cloud providers such as AWS, Google or Azure support or use these distributed approaches themselves to further simplify the scaling of ML solutions for developers.[73].By utilizing speculative computations of parameter updates, the performance of these distributed systems can be increased even further [86].
A further limiting factor is the increased chance of overplotting for a high number of data points.This could be addressed using strategies for a reasonable selection of instances such that only a manageable subset is used, and the labels are transferred to the entire dataset via the model.The application of a kernel density estimator followed by a downsampling technique to reduce the complexity of the plot by removing instances of high-density regions represents such a concrete solution among others [53].
Another potential limitation is, that we assumed a-priori knowledge of the number of classes.This knowledge is not always available in real applications.How the distribution and awareness of the classes affect the performance was not evaluated and remains for potential future work.
Future research could be conducted on how the degree and form of guidance can be adapted to the users' individual needs so that they can determine the degree of guidance themselves to either receive recommendations or precise instructions.Besides, it would be useful to know whether a model or algorithm can automatically determine what kind of guidance a user needs.Research into innovative ways of guiding users in their work that go beyond highlighting data points is also a field for further research.We further see untapped potential in more in-depth research of hybrid approaches like ours combining user-and model-based instance selection.By optimal collaboration among analysts and models, the gap between ML and interactive visualizations could be narrowed and perhaps even completely closed in the future.
Funding Open Access funding enabled and organized by Projekt DEAL.

Declarations
Conflict of interest All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Fig. 1
Fig. 1 Workspace of VisGIL with the areas Instance Overview & Selection (center), Detail View & Filtering (bottom right), Labeling Interface (top right), and Support Visualizations (left).The selected instances

Fig. 2
Fig. 2 Schematic illustration of steps to guide users to useful instances for labeling through visual cues in the Labeling Interface, combining the partial results of components Candidate Suggestion and Result Visualization from Fig. 3

Table 1 Table 2 Q 1 :Q 2 :Q 3 :Q 4 :Q 5 :Q 6 :Q 7 :
Comparison of the features included in the two setups S 1 and S 2 used for the study.S 2 provides users with additional visual cues for instance selection Feature In S 1 In S 2 Listing and visual representation of the calculated utility per instance in Detail View & Filtering Mapping of the instances class to the color of its visual representation in Instance Overview & Selection (color coding) Mapping of the instances utility to the size of its visual representation in Instance Overview & Selection (size coding) Calculation and representation of recommended instances as highlighted stars in Instance Overview & Selection (shape coding) Questions for user self-assessment regarding the identification of instances for labeling and the influence of guidance on their selection (5-point Likert scales).[28] Question Text Answer options [1...5] How confident were you in identifying interesting instances WITHOUT recommendation stars and different sized icons?Insecure Neutral Confident How difficult did you find the identification of interesting instances WITHOUT support of the stars and different sized icons?Hard Neutral Easy How confident were you in identifying interesting instances INCLUDING recommendation stars and different sized icons?Insecure Neutral Confident How difficult did you find the identification of interesting instances INCLUDING support of the stars and different sized icons?Hard Neutral Easy How much did the recommendation stars and the icons INCLUDING different sizes influenced your decision, which instances are interesting?Weak Neutral Strong In which way did the recommendation stars and the icons INCLUDING different sizes influence your decision, which instances are interesting?Confusing Neutral Strengthening How did you experience the recommendation stars and Icons INCLUDING different sizes when identifying interesting instances compared to identification WITHOUT additional support?Hindering Neutral Helpful surement.We decided to synchronize all recordings instead of discarding longer and shorter ones.This conservative data cleansing kept the sample size more representative at n = 16.The results of the guidance degrees orienting and directing are shown with line diagrams w.r.t time.Variations in the performance created by the individual subjects are shown by bundles (see Figs. 4, 5).The transparently colored area around the mean line depicts the interquartile range (IQR) of the measured results.The dashed lines indicate the minimum and maximum performance.The results with guidance degree directing (S 2 ) are shown in red and orienting (S 1 ) in blue.

Table 3
Comparison mean values between results with guidance degree orienting and directing for all stated hypothesis with corresponding p-values and rejection of the null hypothesis