A Survey of Visual Analytics Techniques for Machine Learning

Visual analytics for machine learning has recently evolved as one of the most exciting areas in the field of visualization. To better identify which research topics are promising and to learn how to apply relevant techniques in visual analytics, we systematically review 259 papers published in the last ten years together with representative works before 2010. We build a taxonomy, which includes three first-level categories: techniques before model building, techniques during model building, and techniques after model building. Each category is further characterized by representative analysis tasks, and each task is exemplified by a set of recent influential works. We also discuss and highlight research challenges and promising potential future research opportunities useful for visual analytics researchers.


Introduction
The recent success of artificial intelligence applications depends on the performance and capabilities of machine learning models [163]. In the past ten years, a variety of visual analytics methods have been proposed to make machine 1  learning more explainable, trustworthy, and reliable. These research efforts fully combine the advantages of interactive visualization and machine learning techniques to facilitate the analysis and understanding of the major components in the learning process, with an aim to improve performance. For example, visual analytics research for explaining the inner workings of deep convolutional neural networks has increased the transparency of deep learning models and has received ongoing, and increasing, attention recently [54,103,163,286].
The rapid development of visual analytics techniques for machine learning yields an emerging need for a comprehensive review of this area to support the understanding of how visualization techniques are designed and applied to machine learning pipelines. There have been several initial efforts to summarize the advances in this field from different viewpoints. For example, Liu et al. [162] summarized visualization techniques for text analysis. Lu et al. [173] surveyed visual analytics techniques for predictive models. Recently, Liu et al. [163] presented a paper on the analysis of machine learning models from the visual analytics viewpoint. Sacha et al. [218] analyzed a set of example systems and proposed an ontology for visual analytics assisted machine learning. However, existing surveys either focus on a specific area of machine learning (e.g. text mining [162], predictive models [173], model understanding [163]), or aim to sketch an ontology [218] based on a set of example techniques only.
In this paper, we aim to provide a comprehensive survey of visual analytics techniques for machine learning, which focuses on every phase of the machine learning pipeline. We focus on works in the visualization community. Nevertheless, the AI community has also made solid contributions to the study of visually explaining feature detectors in deep learning models. For example, Selvaraju et al. [222] tried to identify the part of an image to which its classification result is sensitive, by computing class activation maps. Readers can refer to the surveys of Zhang et al. [227] and Hohman et al. [103] for more details. We have collected 259 papers from related toptier venues in the past ten years through a systematical procedure. Based on the machine learning pipeline, we divide this literature as relevant to three stages: before, during, and after model building. We analyze the functions of visual analytics techniques in the three stages and abstract typical tasks, including improving data quality and feature quality before model building, model understanding, diagnosis, and steering during model building, and data understanding after model building. Each task is illustrated by a set of carefully selected examples. We highlight six prominent research directions and open problems in the field of visual analytics for machine learning. We hope that this survey promotes discussion of machine learning related visual analytics techniques and acts as a starting point for practitioners and researchers wishing to develop visual analytics tools for machine learning.

Paper Selection
In this paper, we focus on visual analytics techniques that help to develop explainable, trustworthy, and reliable machine learning applications.
To comprehensively survey visual analytics techniques for machine learning, we performed an exhaustive manual review of relevant top-tier venues in the past ten years (2010-2020): these were InfoVis, VAST, Vis (later SciVis), EuroVis, PacificVis, IEEE TVCG, CGF, and CG&A. The manual review was conducted by three Ph.D. candidates with more than two years of research experience in visual analytics. We followed the manual review process used in a text visualization survey [162]. Specifically, we first considered the titles of papers from these venues to identify candidate papers. Next, we reviewed the abstracts of the candidate papers to further determine whether they concerned visual analytics techniques for machine learning. If the title and abstract did not provide clear information, the full text was gone through to make a final decision. In addition to the exhaustive manual review of the above venues, we also searched for the representative related works that appeared earlier or in other venues, such as the Profiler [123].
After this process, 259 papers were selected. Tab. 1 presents detailed statistics. Due to the increase in machine learning techniques over the past ten years, this field has been attracting ever more research attention.

Taxonomy
In this section, we comprehensively analyze the collected visual analytics works to systematically understand the major research trends. These works are categorized based on a typical machine learning pipeline [183] used to solve real-world problems. As shown in Fig. 1, such a pipeline contains three stages: (1) data pre-processing before model building, (2) machine learning model building, and (3) deployment after the model is built. Accordingly, visual analytics techniques for machine learning can be mapped into these three stages: techniques before model building, techniques during model building, and techniques after model building.

Techniques before Model Building
The major goal of visual analytics techniques before model building is to help model developers better prepare the data for model building. The quality of the data is mainly determined by the data itself and the features used. Accordingly, there are two research directions, visual analytics for data quality improvement and feature engineering.
Data quality can be improved in various ways, such as completing missing data attributes and correcting wrong data labels. Previously, these tasks were mainly conducted manually or by automatic methods, such as learning-from-crowds algorithms [108] which aim to estimate ground-truth labels from noisy crowd-sourced labels. To reduce experts' efforts or further improve the results of automatic methods, some works employ visual analytics techniques to interactively improve the data quality. Tab. 1 shows that in recent years, this topic has gained increasing research attention.
Feature engineering is used to select the best features to train the model. For example, in computer vision, we could use HOG (Histogram of Oriented Gradient) features instead of using raw image pixels. In visual analytics, interactive feature selection provides an interactive and iterative feature selection process. In recent years, in the deep learning era, feature selection and construction are mostly conducted via neural networks. Echoing this trend, there is reducing research attention in recent years (2016-2020) in this direction (see Tab. 1).

Techniques during Model Building
Model building is a central stage in building a successful machine learning application. Developing visual analytics methods to facilitate model building is also a growing research direction in visualization (see Tab. 1). In this survey, we categorize current methods by their analysis goal: model understanding, diagnosis, and steering. Model understanding methods aim to visually explain the working mechanisms of a model, such as how changes in parameters influence the model and why the model gives a certain output for a specific input. Model diagnosis methods target diagnosing errors in model training via interactive exploration of the training process. Model steering methods are mainly aimed at interactively improving model performance. For example, to refine a topic model, Utopian [53] enables users to interactively merge or split topics, and automatically modify other topics accordingly.

Techniques after Model Building
After a machine learning model has been built and deployed, it is crucial to help users (e.g. domain experts) understand the model output in an intuitive way, to promote trust in the model output. To this end, there are many visual analytics methods to explore model output, for a variety of applications. Unlike methods for model understanding during model building, these methods usually target model users rather than model developers. Thus, the internal workings of a model are not illustrated, but the focus is on the intuitive presentation and exploration of model output. As these methods are often data-driven or application-driven, in this survey, we categorize these methods by the type of data being analyzed, particularly as static data or temporal data.

Techniques before Model Building
Two major tasks required before building a model are data processing and feature engineering. They are critical, as practical experience indicates that lowquality data and features degrade the performance of machine learning models [197,243]. Data quality issues include missing values, outliers, and noise in instances and their labels.
Feature quality issues include irrelevant features, redundancy between features, etc. While manually addressing these issues is timeconsuming, automatic methods may suffer from poor performance. Thus, various visual analytics techniques have been developed to reduce experts' efforts as well as to simultaneously improve the performance of automatic methods of producing high-quality data and features [156].

Improving Data Quality
Data includes instances and their labels [199]. From this perspective, existing efforts for improving data quality either concern instance-level improvement, or label-level improvement.

Instance-level Improvement
At the instance level, many visual analytics methods focus on detecting and correcting anomalies in data, Tab. 1 Categories of visual analytics techniques for machine learning and representative works in each category; number of papers given in brackets.
Thus, Bors et al. [25] proposed DQProv Explorer to support the analysis of data processing provenance, using a provenance graph to support the navigation of data states and a quality flow to present changes in data quality over time. Recently, another type of data anomaly, out-of-distribution (OoD) samples, has received extensive attention [139,142]. OoD samples are test samples that are not well covered by training data, which is a major source of model performance degradation. To tackle this issue, Chen et al. [45] proposed OoDAnalyzer to detect and analyze OoD samples. An ensemble OoD detection method, combining both high-and low-level features, was proposed to improve detection accuracy. Based on the detection result, a grid visualization (see Fig. 2) is utilized to explore OoD samples in context and explain the underlying reasons for their presence. In order to generate grid layouts at interactive rates during the exploration, a kNN-based grid layout algorithm motivated by Hall's theorem was developed.
When considering time-series data, several challenges arise as time has distinct characteristics that induce specific quality issues that require analysis in a temporal context. To tackle this issue, Arbesser et al. [11] proposed a visual analytics system, Visplause, to visually assess time-series data quality. Anomaly detection results, e.g. frequencies of anomalies and their temporal distributions, are shown in a tabular layout. In order to address the scalability problem, data are aggregated in a hierarchy based on metainformation, which enables analysis of a group of anomalies (e.g. abnormal time series of the same type) simultaneously. Besides automatically detected anomalies, KYE [91] also supports the identification of additional anomalies overlooked by automatic methods. Time-series data are presented in a heatmap view, where abnormal patterns (e.g. regions with unusually high values) indicate potential anomalies. Click stream data are a widely studied kind of time-series data in the field of visual analytics. To better analyze and refine click stream data, Segmentifier [61] was proposed to provide an iterative exploration process for segmentation and analysis. Users can explore segments in three coordinated views at different granularities and refine them by filtering, partitioning, and transformation. Every refinement step results in new segments, which can be further analyzed and refined.
To tackle uncertainties in data quality improvement, Bernard et al. [16] developed a visual analytics tool to exhibit the changes in the data and uncertainties caused by different preprocessing methods. This tool enables experts to become aware of the effects of these methods and to choose suitable ones, to reduce taskirrelevant parts while preserving task-relevant parts of the data.
As data have the risk of exposing sensitive information, several recent studies have focused on preserving data privacy during the data quality improvement process. For tabular data, Wang et al. [259] developed a Privacy Exposure Risk Tree to display privacy exposure risks in the data and a Utility Preservation Degree Matrix to exhibit how the utility changes as privacy-preserving operations are applied. To preserve privacy in network datasets, Wang et al. [257] presented a visual analytics system, GraphProtector. To preserve important structures of networks, node priorities are first specified based on their importance. Important nodes are assigned low priorities, reducing the possibility of modifying these nodes. Based on node priorities and utility metrics, users can apply and compare a set of privacy-preserving operations and choose the most suitable one according to their knowledge and experience.

Label-level Improvement
According to whether the data have noisy labels, existing works can be classified as methods either for improving the quality of noisy labels or allowing interactive labeling.
Crowdsourcing provides a cost-effective way to collect labels.
However, annotations provided by crowd workers are usually noisy [152,243]. Many methods have been proposed to remove noise in labels. Willett et al. [268] developed a crowd-assisted clustering method to remove redundant explanations provided by crowd workers.
Explanations are clustered into groups, and the most representative ones are preserved.
Park et al. [205] proposed C 2 A that visualizes crowdsourced annotations and worker behavior to help doctors identify malignant tumors in clinical videos. Using C 2 A, doctors can discard most tumor-free video segments and focus on the ones that most likely to contain tumors. To analyze the accuracy of crowdsourcing workers, Park et al. [204] developed CMed that visualizes clinical image annotations by crowdsourcing, and workers' behavior. By clustering workers according to their annotation accuracy and analyzing their logged events, experts are able to find good workers and observe the effects of workers' behavior patterns. LabelInspect [157] was proposed to improve crowdsourced labels by validating uncertain instance labels and unreliable workers. Three coordinated visualizations, a confusion (see Fig. 3(a)), an instance (see Fig. 3(b)), and a worker visualization (see Fig. 3(c)), were developed to facilitate the identification and validation of uncertain instance labels and unreliable workers. Based on expert validation, further instances and workers are recommended for validation by an iterative and progressive verification procedure.
Although the aforementioned methods can effectively improve crowdsourced labels, crowd information is not available in many real-world datasets.
For example, the ImageNet dataset [215] only contains the cleaned labels produced by automatic noise removal methods. To tackle these datasets, Xiang et al. [275] developed DataDebugger to interactively improve data quality by utilizing user-selected trusted items. A hierarchical visualization combined with an incremental projection method and an outlier biased sampling method facilitate the exploration and identification of trusted items. Based on these identified trusted items, a data correction algorithm propagates labels from trusted items to the whole dataset. Paiva et al. [202] assumed that instances misclassified by a trained classifier were likely to be mislabeled instances. Based on this assumption, they employed a Neighbor Joining Tree enhanced by multidimensional projections to help users explore misclassified instances and correct mislabeled ones. After correction, the classifier is refined using the corrected labels, and a new round of correction starts. Bäuerle et al. [14] developed three classifier-guided measures to detect data errors. Data errors are then presented in a matrix and a scatter plot, allowing experts to reason about and resolve errors.
All the above methods start with a set of labeled data with noise. However, many datasets do not contain such a label set. To tackle this issue, many visual analytics methods have been proposed for interactive labeling. Reducing labeling effort is a major goal of interactive labeling. To this end, Moehrmann et al. [193] used an SOM-based visualization to place similar images together, allowing users to label multiple similar images of the same class in one go. This strategy is also used by Khayat et al. [125] to identify social spambot groups with similar anomalous behavior, Kurzhals et al. [136] to label mobile eyetracking data, and Halter et al. [96] to annotate and analyze primary color strategies used in films. Apart from placing similar items together, other strategies, like filtering, have also been applied to find items of interest for labeling. Filtering and sorting are utilized in MediaTable [214] to find similar video segments. A table visualization is utilized to present video segments and their attributes. Users can filter out irrelevant segments and sort on attributes to order relevant segments, allowing users to label several segments of the same class simultaneously. Stein et al. [232] provided a rule-based filtering engine to find patterns of interest in soccer match videos. Experts can interactively specify rules through a natural language GUI.
Recently, to enhance the effectiveness of interactive labeling, various visual analytics methods have combined visualization techniques with machine learning techniques, such as active learning. The concept of 'intra-active labeling' was first introduced by Hoferlin et al. [102]; it enhances active learning with human knowledge. Users are not only able to query instances and label them via active learning, but also to understand and steer machine learning models interactively. This concept is also used in text document retrieval [101], sequential data retrieval [144], trajectory classification [118], identifying relevant tweets [228], and argumentation mining [229]. For example, to annotate text fragments in argumentation mining tasks, Sperrle et al. [229] developed a language model for fragment recommendation. A layered visual abstraction is utilized to support five relevant analysis tasks required by text fragment annotation. In addition to developing systems for interactive labeling, some empirical experiments were conducted to demonstrate their effectiveness. For example, Bernard et al. [17] conducted experiments to show the superiority of user-centered visual interactive labeling over modelcentered active learning. A quantitative analysis [18] was also performed to evaluate user strategies for selecting samples in the labeling process. Results show that in early phases, data-based (e.g. clusters and dense areas) user strategies work well. However, in later phases, model-based (e.g. class separation) user strategies perform better.

Improving Feature Quality
A typical method to improve feature quality is selecting useful features that contribute most to the prediction, i.e. feature selection [44]. A common feature selection strategy is to select a subset of features that minimizes the redundancy between them and maximizes the relevance between them and targets (e.g. classes of instances) [184]. Along this line, several methods have been developed to interactively analyze the redundancy and relevance of features. For example, Seo et al. [223] proposed a rank-by-feature framework, which ranks features by relevance. They visualized ranking results with tables and matrices. Ingram et al. [109] proposed a visual analytics system, DimStiller, which allows users to explore features and their relationships and interactively remove irrelevant and redundant features. May et al. [184] proposed SmartStripes to select different feature subsets for different data subsets. A matrix-based layout is utilized to exhibit the relevance and redundancy of features. Mühlbacher et al. [195] developed a partition-based visualization for the analysis of the relevance of features or feature pairs. The features or feature pairs are partitioned into subdivisions, which allows users to explore the relevance of features (or feature pairs) at different levels of detail. A parallel coordinates visualization was utilized by Tam et al. [239] to identify features that could discriminate between different clusters. Krause et al. [132] ranked features across different feature selection algorithms, cross-validation folds, and classification models. Users are able to interactively select the features and models that lead to the best performance. Besides selecting existing features, constructing new features is also useful in model building. For example, FeatureInsight [30] was proposed to construct new features for text classification. By visually examining classifier errors and summarizing the root causes of these errors, users are able to create new features that can correctly discriminate misclassified documents. To improve the generalization capability of new features, visual summaries are used to analyze a set of errors instead of individual errors.

Techniques during Model Building
Machine learning models are usually regarded as black boxes because of their lack of interpretability, which hinders their practical use in risky scenarios such as self-driving cars and financial investment. Current visual analytics techniques in model building explore how to reveal the underlying working mechanisms of machine learning models and then help model developers to build well-performed models.
First of all, model developers require a comprehensive understanding of models in order to release them from a time-consuming trial-and-error process. When the training process fails or the model does not provide satisfactory performance, model developers need to diagnose the issues occurring in the training process. Finally, there is a need to assist in model steering as much time is spent in improving model performance during the model building process. Echoing these needs, researchers have developed many visual analytics methods to enhance model understanding, diagnosis, and steering [54,163].

Model Understanding
Works related to model understanding belong to two classes: those understanding the effects of parameters, and those understanding model behaviours.

Understanding the Effects of Parameters
One aspect of model understanding is to inspect how the model outputs change with changes in model parameters. For example, Ferreira et al. [79] developed BirdVis to explore the relationships between different parameter configurations and model outputs; these were bird occurrence predictions in their application. The tool also reveals how these parameters are related to each other in the prediction model. Zhang et al. [293] proposed a visual analytics method to visualize how variables affect statistical indicators in a logistic regression model.

Understanding Model Behaviours
Another aspect is how the model works to produce the desired outputs. There are three main types of methods used to explain model behaviours, namely network-centric, instance-centric, and hybrid methods. Network-centric methods aim to explore the model structure and interpret how different parts of the model (e.g. neurons or layers in convolutional neural networks) cooperate with each other to produce the final outputs. Earlier works employ directed graph layouts to visualize the structure of neural networks [245], but visual clutter becomes a serious problem as the model structure becoming increasingly complex. To tackle this problem, Liu et al. [155] developed CNNVis to visualize deep convolutional neural networks (see Fig. 4). It leverages clustering techniques to group neurons with similar roles as well as their connections in order to address visual clutter caused by their huge quantity.This tool helps experts understand the roles of the neurons and their learned features, and moreover, how low-level features are aggregated into high-level ones through the network. Later, Wongsuphasawat et al. [269] designed a graph visualization for exploring the machine learning model architecture in Tensorflow [1]. They conducted a series of graph transformations to compute a legible interactive graph layout from a given low-level dataflow graph to display the high-level structure of the model.
Instance-centric methods aim to provide instancelevel analysis and exploration, as well as understanding the relationships between instances. Rauber et al. [210] visualized the representations learned from each layer in the neural network by projecting them onto 2D scatterplots. Users can identify clusters and confusion areas in the representation projections and, therefore, understand the representation space learned by the network.
Furthermore, they can study how the representation space evolves during training so as to understand the network's learning behaviour. Some visual analytics techniques for understanding recurrent neural networks (RNNs) also adopt such an instancecentric design. LSTMVis [235]  Hybrid methods combine the above two methods and leverage both of their strengths. In particular, instance-level analysis can be enhanced with the context of the network architecture. Such contexts benefit the understanding of the network's working mechanism.
For instance, Hohman et al. [104] proposed Summit, to reveal important neurons and critical neuron associations contributing to the model prediction.
It integrates an embedding view to summarize the activations between classes and an attribute graph view to reveal influential connections between neurons. Kahng et al. [119] proposed ActiVis for large-scale deep neural networks. It visualizes the model structure with a computational graph and the activation relationships between instances, subsets, and classes using a projected view.
In recent years, there have been some efforts to use a surrogate explainable model to explain model behaviours. The major benefit of such methods is that they do not require users to investigate the model itself. Thus, they are more useful for those with no or limited machine learning knowledge. Treating the classifier as a black box, Ming et al. [189] first extracted rule-based knowledge from the input and output of the classifier. These rules are then visualized using RuleMatrix, which supports interactive exploration of the extracted rules by practitioners, improving the interpretability of the model. Wang et al. [254] developed DeepVID to generate a visual interpretation for image classifiers. Given an image of interest, a deep generative model was first used to generate samples near it. These generated samples were used to train a simpler and more interpretable model, such as a linear regression classifier, which helps explain how the original model makes the decision.

Model Diagnosis
Visual analytical techniques for model diagnosis may either analyze the training results or analyze the training dynamics.

Analyzing Training Results
Tools have been developed for diagnosing classifiers based on their performance [7,19,86,211]. For example, Squares [211] used boxes to represent samples and group them according to their prediction classes. Using different textures to encode true/false positives/negatives, this tool allows fast and accurate estimation of performance metrics at multiple levels of detail. Recently, the issue of model fairness has drawn growing attention [2,32,267]. For example, Ahn et al. [2] proposed a framework named FairSight and implemented a visual analytics system to support the analysis of fairness in ranking problems. They divided the machine learning pipeline into three phases (data, model, and outcome) and then measured the bias both  at individual and group levels using different measures. Based on these measures, developers can iteratively identify those features that cause discrimination and remove them from the model. Researchers are also interested in exploring potential vulnerabilities in models that prevent them from being reliably applied to real-world applications [33,178]. Cao et al. [33] proposed AEVis to analyze how adversarial examples fool neural networks. The system (see Fig. 5) takes both normal and adversarial examples as input and extracts their datapaths for model prediction. It then employs a river-based metaphor to show the diverging and merging patterns of the extracted datapaths, which reveal where the adversarial samples mislead the model. Ma et al. [178] designed a series of visual representations from overview to detail to reveal how data poisoning will make a model misclassify a specific sample. By comparing the distributions of the poisoned and normal training data, experts can deduce the reason for the misclassification of the attacked sample.

Analyzing Training Dynamics
Recent efforts have also been concentrated on analyzing the training dynamics. These techniques are intended for debugging the training process of machine learning models. For example, DGMTracker [154] assists experts to discover reasons for the failed training process of deep generative models. It utilizes a bluenoise polyline sampling algorithm to simultaneously keep the outliers and the major distribution of the training dynamics in order to help experts detect the potential root cause of a failure.
It also employs a credit assignment algorithm to disclose the interactions between neurons to facilitate the diagnosis of failure propagation. Attention has also been given to the diagnosis of the training process of deep reinforcement learning. Wang et al. [252] proposed DQNViz for the understanding and diagnosis of deep Q-networks for a Breakout game. At the overview level, DQNViz presents changes in the overall statistics during the training process with line charts and stacked area charts. Then at the detail level, it uses segment clustering and a pattern mining algorithm to help experts identify common as well as suspicious patterns in the event-sequences of the agents in Q-networks. As another example, He et al. [98] proposed DynamicsExplorer to diagnose an LSTM trained to control a ball-in-maze game. To support quick identification of where training failures arise, it visualizes ball trajectories with a trajectory variability plot, as well as their clusters using a parallel coordinates plot.

Model Steering
There are two major strategies for model steering: refining the model with human knowledge and selecting the best model from a model ensemble.

Model Refinement with Human Knowledge
Several visual analytics techniques have been developed to place users into the loop of the model refinement process, through flexible interaction.
Users can directly refine the target model with visual analytics techniques. A typical example is ProtoSteer [190], a visual analytics system that enables editing prototypes to refine a prototype sequence network named ProSeNet [191]. ProtoSteer uses four coordinated views to present the information about the learned prototypes in ProSeNet. Users can refine these prototypes by adding, deleting, and revising specific prototypes. The model is then retrained with these user-specific prototypes for performance gain. In addition, van der Elzen et al. [246] proposed BaobabView to support experts to construct decision trees iteratively using domain knowledge. Experts can refine the decision tree with direct operations, including growing, pruning, and optimizing the internal nodes, and can evaluate the refined one with various visual representations.
Besides direct model updates, users can also correct flaws in the results or provide extra knowledge, allowing the model to be updated implicitly to produce improved results based on human feedback. Several works have focused on incorporating user knowledge into topic models to improve their results [53,69,73,127,262,283]. For instance, Yang et al. [283] presented ReVision that allows users to steer hierarchical clustering results by leveraging an evolutionary Bayesian rose tree clustering algorithm with constraints. As shown in Fig. 6, the constraints and the clustering results are displayed with an uncertainty-aware tree-based visualization to guide the steering of the clustering results. Users can refine the constraint hierarchy by dragging. Documents are then re-clustered based on the modified constraints. Other human-in-theloop models have also stimulated the development of visual analytic systems to support such kinds of model refinement. For instance, Liu et al. [153] proposed MutualRanker using an uncertainty-based mutual reinforcement graph model to retrieve important blogs, users, and hashtags from microblog data. It shows ranking results, uncertainty, and its propagation with the help of a composite visualization; users can examine the most uncertain items in the graph and adjust their ranking scores. The model is incrementally updated by propagating adjustments throughout the graph.

Model Selection from an Ensemble
Another strategy for model steering is to select the best model from a model ensemble, which is usually found in clustering [41,201,221] and regression models [23,60,170,208]. Clustrophile 2 [41] is a visual  [66] measures similarity between different users after analyzing their posting contents, and reveals their relationships using t-SNE projection.
analytics system for visual clustering analysis, which guides user selection of appropriate input features and clustering parameters through recommendations based on user-selected results. BEAMES [60] was designed for multimodel steering and selection in regression tasks. It creates a collection of regression models by varying algorithms and their corresponding hyperparameters, with further optimization by interactive weighting of data instances and interactive feature selection and weighting. Users can inspect them and then select an optimal model according to different aspects of performance, such as their residual scores and mean squared errors.

Techniques after Model Building
Existing visual analytics efforts after model building aim to help users understand and gain insights from model outputs, such as high-dimensional data analysis results [158,162].
As these methods are often data-driven, we categorize the corresponding methods according to the type of data analyzed. The temporal property of data is critical in visual design. Thus, we classify methods as those understanding static data analysis results, and those understanding dynamic data analysis results. A visual analytics system for understanding static data analysis results usually treats all model output as a large collection and analyzes the static structure. For dynamic data, in addition to understanding the analysis results at each time point, the system focuses on illustrating the evolution of data over time, which is learned by the analysis model.

Understanding Static Data Analysis Results
We summarize the research on understanding static data analysis according to the type of data. Most research focuses on textual data analysis, while fewer works study the understanding of other types of data analysis.

Textual Data Analysis
The most widely studied topic is visual text analytics, which tightly integrates interactive visualization techniques with text mining techniques (e.g. document clustering, topic models, and word embedding) to help users better understand a large amount of textual data [162].
Some early works employed simple visualizations to directly convey the results of classical text mining techniques, such as text summarization, categorization, and clustering.
For example, Görg et al. [89] developed a multi-view visualization consisting of a list view, a cluster view, a word cloud, a grid view, and a document view, to visually illustrate analysis results of document summarization, document clustering, sentiment analysis, entity identification, and recommendation.
By combining interactive visualization with text mining techniques, a smooth and informative exploration environment is provided to users.
Most later research has focused on combining welldesigned interactive visualization with state-of-the-art text mining techniques, such as topic models and deep learning models, to provide deeper insights into textual data. To provide an overview of the relevant topics discussed in multiple sources, Liu et al. [161] first utilized a correlated topic model to extract topic graphs from multiple text sources. A graph matching algorithm is then developed to match the topic graphs from different sources, and a hierarchical clustering method is employed to generate hierarchies of topic graphs. Both the matched topic graph and hierarchies are fed into a hybrid visualization which consists of a radial icicle plot and a density-based node-link diagram (see Fig. 7(a)), to support exploration and analysis of common and distinctive topics discussed in multiple sources. Dou et al. [66] introduced DemographicVis to analyze different demographic groups on social media based on the content generated by users. An advanced topic model, latent Dirichlet allocation (LDA) [179], is employed to extract topic features from the corpus. Relationships between the demographic information and extracted features are explored through a parallel sets visualization [130], and different demographic groups are projected onto the two-dimension space based on the similarity of their topics of interest (see Fig. 7(b)). Recently, some deep learning models have also been adopted because of their better performance. For example, Berger et al. [15] proposed cite2vec to visualize the latent themes in a document collection via document usage (e.g. citations). It extended a famous word2vec model, the skip-gram model [187], to generate the embedding for both words and documents by considering the citation information and the textual content together. The words are projected into a twodimensional space using t-SNE first, and the documents are projected onto the same space, where both the document-word relationship and document-document relationships are considered simultaneously.

Other Data Analysis
In addition to textual data, other types of data have also been studied.
For example, Hong et al. [105] analyzed flow fields through an LDA model by defining pathlines as documents and features as words, respectively.
After modeling, the original pathlines and extracted topics were projected into a two-dimensional space using multidimensional scaling, and several previews were generated to render the pathlines for important topics. Recently, a visual analytics tool, SMARTexplore [22], was developed to help analysts find and understand interesting patterns within and between dimensions, including correlations, clusters, and outliers. To this end, it tightly couples a table-based visualization with pattern matching and subspace analysis.

Understanding Dynamic Data Analysis Results
In addition to understanding the results of static data analysis, it is also important to investigate and analyze how latent themes in data change over time. For example, a system can help politicians to make timely decisions if it provides an overview of major public opinions on social media and how they change over time. Most existing works focus on understanding the analysis results of a data corpus where each data item is associated with a time stamp. According to whether the system supports the analysis of streaming data, we may further classify existing works on visual dynamic data analysis as offline and online. In offline analysis, all data are available before analysis, while online analysis tackles streaming data that is incoming during the analysis process.

Offline Analysis
Offline analysis research can be classified according to the analysis task: topic analysis, event analysis, and trajectory analysis.
Understanding topic evolution in a large text corpus over time is an important topic, attracting much attention. Most existing works adopt a river metaphor to convey changes in the text corpus over time. ThemeRiver [97] is one of the pioneering works, using the river metaphor to reveal changes in the volumes of different themes. To better understand the content change of a document corpus, TIARA [166,265] utilizes an LDA model [21] to extract topics from the corpus and reveal their changes over time. However, only observing volumes and content change is not enough for complex analysis tasks where users want to explore relationships between different topics and their changes over time. Therefore, later works have focused on understanding relationships between topics (e.g. topic splitting and merging) and their evolving patterns over time. For example, Cui et al. [58] first extracted topic splitting and merging patterns from a document collection using an incremental hierarchical Dirichlet process model [240]. Then a river metaphor with a set of well-designed glyphs was developed to visually illustrate the aforementioned topic relationships and their dynamic changes over time. Xu et al. [281] leveraged a topic competition model to extract dynamic competition between topics and the effects of opinion leaders on social media. Sun et al. [237] extended the competition model to a 'coopetition' (cooperation and competition) model to help understand the more complex interactions between evolving topics. Wang et al. [261] proposed IdeaFlow, a visual analytics system for learning the lead-lag relationships across different social groups over time. However, these works use a flat structure to model topics, which hampers their usage in the era of big data for handling large-scale text corpora. Fortunately, there are already initial efforts in coupling hierarchical topic models with interactive visualization to favor the understanding of the main content in a large text corpus. For example, Cui et al. [59] extract a sequence of topic trees using an evolutionary Bayesian rose tree algorithm [263] and then calculates the tree cut for each tree. These tree cuts are used to approximate the topic trees and display them in a river metaphor, which also reveals dynamic relationships between the topics, including topic birth, death, splitting, and merging. Event analysis targets revealing common or semantically important sequential patterns in ordered sequences of events [94,112,169,177]. To facilitate visual exploration of large scale event sequences and pattern discovery, several visual analytics methods have been proposed. For example, Liu et al. [169] developed a visual analytics method for click stream data. Maximal sequential patterns are discovered and pruned from the click stream data. The extracted patterns and original data are well illustrated at four granularities: patterns, segments, sequences, and events. Guo et al. [94] developed EventThread, which uses a tensor-based model to transform the event sequence data into an n-dimensional tensor. Latent patterns (threads) are extracted with a tensor decomposition technique, segmented into stages, and then clustered. These threads are represented as segmented linear stripes, and a line map metaphor is used to reveal the changes between different stages. Later, EventThread was extended to overcome the limitation of the fixed length of each stage [93]. The authors proposed an unsupervised stage analysis algorithm to effectively identify the latent stages in event sequences. Based on this algorithm, an interactive visualization tool was developed to reveal and analyze the evolution patterns across stages.
Other works focus on understanding movement data (e.g. GPS records) analysis results. Andrienko et al. [10] extracted movement events from trajectories and then performed spatio-temporal clustering for aggregation. These clusters are visualized using spatiotemporal envelopes to help analysts find potential traffic jams in the city. Chu et al. [55] adopted an LDA model for mining latent movement patterns in taxi trajectories. The movement of each taxi, represented by the traversed street names, was regarded as a document. Parallel coordinates were used to visualize the distribution of streets over topics, where each axis represents a topic, and each polyline represents a street. The evolution of the topics was visualized as topic routes that connect similar topics between adjacent time windows.
More recently, Zhou et al. [300] treated origin-destination flows as words and trajectories as paragraphs, respectively. Therefore, a word2Vec model was used to generate the vectorized representation for each origin-destination flow. t-SNE was then employed to project the embedding of the flows into two-dimensional space, where analysts can check the distributions of the origin-destination flows and select some for display on the map. Besides directly analyzing the original trajectory data, other papers try to augment the trajectories with auxiliary information to reduce the burden on visual exploration. Kruger et al. [135] clustered destinations with DBScan and then used Foursquare to provide detailed information about the destinations (e.g. shops, university, residence). Based on the enriched data, frequent patterns were extracted and displayed in the visualization (see Fig. 9); icons on the time axis help understand these patterns. Chen et al. [50] mined trajectories from geo-tagged social media and displayed keywords extracted from the text content, helping users explore the semantics of trajectories.

Online Analysis
Online analysis is especially necessary for streaming data, such as text streams. As a pioneering work for online analysis of text streams, Thom et al. [241] proposed ScatterBlog to analyze geo-located tweet streams. The system uses Twitter4J to get streaming tweets and extracts location, time, user ID, and tokenized terms in the tweets. To efficiently analyze a tweet stream, an incremental clustering algorithm was employed to cluster similar tweets. Based on the clustering results, spatio-temporal anomalies were detected and reported to users in real-time.
To reduce user effort for filtering and monitoring in ScatterBlogs, Bosch et al. [26] proposed ScatterBlogs2, which enhanced ScatterBlogs with machine learning techniques. In particular, an SVM-based classifier was built for filtering tweets of interest, and an LDA model was employed to generate a topic overview.
To efficiently handle high-volume text streams, Liu et al. [165] developed TopicStream to help users analyze hierarchical topic evolution in highvolume text streams. In TopicStream, an evolutionary topic tree was built from text streams, and a tree cut algorithm was developed to reduce visual clutter and enable users to focus on topics of interest. Combining a river metaphor and a visual sedimentation metaphor, the tool effectively illustrates the overall hierarchical topic evolution as well as how newly arriving textual documents are gradually aggregated into the existing topics over time. Triggered by TopicStream, Wu et al. [272] developed StreamExplorer, which enables the tracking and comparison of a social stream. In particular, an entropy-based event detection method was developed to detect events in the social media stream. They are further visualized in a multi-level visualization, including a glyph-based timeline, a map visualization, and interactive lenses. In addition to text streams, other types of streaming data are also analyzed. For example, Lee et al. [140] employed a long short-term memory model for road traffic congestion forecasting and visualized the results with a Volume-Speed Rivers visualization. Propagation of congestion was also extracted and visualized, helping analysts understand causality within the detected congestion.

Research Opportunities
Although visual analytics research for machine learning has achieved promising results in both academia and real-world applications, there are still several long-term research challenges. Here, we discuss and highlight major challenges and potential research opportunities in this area.
6.1 Opportunities before Model Building 6.1.1 Improving Data Quality for Weakly Supervised Learning Weakly supervised learning builds models from data with quality issues, including inaccurate labels, incomplete labels, and inexact labels. Improving data quality can boost the performance of weakly supervised learning models [148]. Most existing methods focus on inaccurate data (e.g. noisy crowdsourced annotations and label errors) quality issues, and interactive labeling related to incomplete data (e.g. none or only a few data are labeled) quality issues. However, fewer efforts are devoted to the better exploitation of unlabeled data related to incomplete data quality issues as well as inexact data (e.g. coarse-grained labels that are not exact as required) quality issues. This paves the way for potential future research.
Firstly, the potential for visual analytics techniques to address the incompleteness issue is not fully exploited.
For example, improving the quality of unlabeled data is critical for semi-supervised learning [148,149], which is tightly combined with a small amount of labeled data during training to infer the correct mapping from the data set to the label set. One typical example is graph-based semi-supervised learning [149], which depends on the relationship between labeled and unlabeled data. Automatically constructed relationships (graphs) are sometimes poor in quality, resulting in model performance degradation. A major cause behind these poor-quality graphs is that automatic graph construction methods usually rely on global parameters (e.g. a global k value in the kNN graph construction method), which may be locally inappropriate. As a consequence, it is necessary to utilize visualization to illustrate how labels are propagated along graph edges, to facilitate understanding of how local graph structures affect model performance. Based on such understanding, experts can adaptively modify the graph to gradually create a higher-quality graph.
Secondly, although the inexact data quality issue is common in real-world applications [303], it has received little attention from the field of visual analytics. This issue refers to the situation where labels are inexact, e.g. coarse-grained labels, such as arise in computed tomography (CT) scans. The labels of CT scans usually come from corresponding diagnosis reports that describe whether patients have certain medical problems (e.g. a tumor). For a CT scan with tumors, we only know that one or more slices in the scan contain tumors. However, we do not know which slices contain tumors as well as the exact tumor locations in these slices. Although various machine learning methods [82,302] have been proposed to learn from such coarsegrained labels, they may lead to poor performance [148] due to the lack of exact information. Fine-grained validation is still required to improve data quality. To this end, one potential solution is to combine interactive visualization with learning algorithms to better illustrate the root cause of bad performance by examining the overall data distribution and the wrong predictions, and to develop an interactive verification process for providing more finely-grained labels while minimizing expert effort.

Explainable Feature Engineering
Most existing works for improving feature quality focus on tabular or textual data from traditional analysis models.
The features of these data are naturally interpretable, which makes the feature engineering process simple.
In addition, features extracted by deep neural networks perform better than handcrafted ones [65,256]. However, these deep features are hard to interpret due to the black box nature of deep neural networks, which brings several challenges for feature engineering.
Firstly, the extracted features are obtained in a data-driven process, which may poorly represent the original images/videos when the datasets are biased. For example, given a dataset with only dark dogs and light cats, the extracted features may emphasize color and ignore other discriminating concepts, like shapes of faces and ears. Without a clear understanding of these biased features, it is hard to correct them in a comprehensive way. Thus, an interesting topic for future work is to utilize interactive visualization to disclose why the features are biased. The key challenge here is how to measure the information preserved or discarded by the extracted features and to visualize it in a comprehensible manner.
Moreover, redundancy exists in extracted deep features [12]. Removing redundant features can lead to several benefits, such as reducing storage requirements and improving generalization [44]. However, without a clear understanding of the exact meaning of features, it is hard to judge whether a feature is redundant. Thus, an interesting future topic is to develop a visual analytics method to convey feature redundancy in a comprehensible way, which allows experts to explore it and remove redundant features.

Online Training Diagnosis
Existing visual analytics tools for model diagnosis mostly work offline: the data for diagnosis is collected after the training process is finished. They have shown their capability for revealing the root causes of failed training processes. However, as modern machine learning models become more and more complex, training processes can last for days or even weeks. Offline diagnosis severely restricts the ability of visual analytics to assist in training. Thus, there is a significant need to develop visual analytics tools for online diagnosis of the training process so that model developers can identify anomalies and promptly make corresponding adjustments to the process. This can save much time in the trial-anderror model building process.
The key challenge for online diagnosis is to detect anomalies in the training process in a timely manner.
While it remains a difficult task to develop algorithms for automatically and accurately detecting anomalies in real-time, interactive visualization promises a way to locate potential errors in the training process. Differing from offline diagnosis, the data of the training process will be continuously fed into the online analysis tool. Thus, progressive visualization techniques are needed to produce meaningful visualization results of partial streaming data. These techniques can help experts monitor online model training processes and identify possible issues rapidly.

Interactive Model Refinement
Recent works have explored the utilization of uncertainty to facilitate interactive model refinement [73,153,262,283].
There are many methods to assign uncertainty scores to model outputs (e.g. based on confidence scores produced by classifiers), and visual hints can be used to guide users to examine model outputs with high uncertainty. Models uncertainty will be recomputed after user refinement, and users can perform iteratively until they are satisfied with the results. Furthermore, additional information can also be leveraged to provide users with more intelligent guidance to facilitate a fast and accurate model refinement process. However, the room for improving interactive model refinement is still largely unexplored by researchers. One possible direction is that since the refinement process usually requires several iterations, guidance in later iterations can be learned from users' previous interactions. For example, in a clustering application, users may define some must-link or cannot-link constraints on some instance pairs, and such constraints can be used to instruct a model to split or merge some clusters in the intermediate result. In addition, prior knowledge can be used to predict where refinements are needed.
For example, model outputs may conflict with certain public or domain knowledge, especially for unsupervised models (e.g. nonlinear matrix factorization and latent Dirichlet allocation for topic modeling), which should be considered in the refinement process. Therefore, such a knowledge-based strategy focuses on revealing unreasonable results produced by the models, allowing users to refine the models by adding constraints to them.

Understanding Multi-modal Data
Existing works on content analysis have achieved great success in understanding single-modal data, such as texts, images, and videos. However, realworld applications often contain multi-modal data, which combines several different content forms, such as text, audio, and images.
For example, a physician diagnoses a patient after considering multiple kinds of data, such as the medical record (texts), laboratory reports (tables), and CT scans (images). When analyzing such multi-modal data, in-depth relationships between different modals cannot be well captured by simply combining knowledge learned from single-modal models. It is more promising to employ multi-modal machine learning techniques and leverage their capability to disclose insights across different forms of data. To this end, a more powerful visual analytics system is crucial for understanding the output of such multi-modal learning models. Many machine learning models have been proposed to learn joint representations of multi-modal data, including natural language, visual signals, and vocal signals [13,171]. Accordingly, an interesting future direction is how to effectively visualize learned joint representations of multi-modal data in an all-in-one manner, to facilitate the understanding of the data and their relationships. Various classic multi-modal tasks can be employed to enhance natural interactions in the field of visual analytics. For example, in the vision-and-language scenario, the visual grounding task (identify the corresponding image area given the description) can be used to provide a natural interface to support naturallanguage-based image retrieval in a visual environment.

Analyzing Concept Drifts
In real-world applications, it is often assumed that the mapping from input data to output values (e.g. prediction label) is static. However, as data continues to arrive, the mapping between the input data and output values may change in unexpected ways [172]. In such a situation, a model trained on historical data may no longer work properly on new data. This usually causes noticeable performance degradation when the application data does not match the training data. Such a non-stationary learning problem over time is known as concept drift. As more and more machine learning applications directly consume streaming data, it is important to detect and analyze concept drift and minimize the resulting performance degradation [258,282]. In the field of machine learning, three main research topics, have been studied: drift detection, drift understanding, and drift adaptation. Machine learning researchers have proposed many automatic algorithms to detect and adapt to concept drift. Although these algorithms can improve the adaptability of learning models in an uncertain environment, they only provide a numerical value to measure the degree of drift at a given time. This makes it hard to understand why and where drift occurs. If the adaptation algorithms fail to improve the model performance, the black-box behavior of the adaptation models makes it difficult to diagnose the root cause of performance degradation. As a result, model developers need tools that intuitively illustrate how data distributions have changed over time, which samples cause drift, and how the training samples and models can be adjusted to overcoming such drift. This requirement naturally leads to a visual analytics paradigm where the expert interacts and collaborates in concept drift detection and adaptation algorithm by putting the human in the loop. The major challenges here are how to (i) visually represent the evolution patterns of streaming data over time and effectively compare data distributions at different points in time, and (ii) tightly integrate such streaming data visualization with drift detection and adaptation algorithms to form an interactive and progressive analysis environment with the human in the loop.

Conclusions
This paper has comprehensively reviewed recent progress and developments in visual analytics techniques for machine learning. These techniques are classified into three groups by the corresponding analysis stage: techniques before, during, and after model building. Each category is detailed by typical analysis tasks, and each task is illustrated by a set of representative works. By comprehensively analyzing existing visual analytics research for machine learning, we also suggest six directions for future machinelearning-related visual analytics research, including improving data quality for weakly supervised learning and explainable feature engineering before model building, online training diagnosis and intelligent model refinement during model building, and multimodal data understanding and concept drift analysis after model building. We hope this survey has provided an overview of visual analytics research for machine learning, facilitating understanding of state-of-the-art knowledge in this area, and shedding light on future research.