Monitoring Large Scale Production Processes Using a Rule-Based Visualization Recommendation System

Data visualization plays an important role in the analysis of data and the identification of insights and characteristics within the dataset. Current visualization tools suffer from limitations with respect to the number of dimensions that can be displayed simultaneously. This often results in the inability to highlight hidden patterns or trends, affecting the analysts’ decision-making capabilities. In this scenario, it becomes imperative to propose systems that recommend useful insights to guide users to understand and explore data more efficiently. Addressing this problem, we propose a rule-based visualization recommendation(VizRec) system which caters for the research goal of providing an intelligent assistant that can guide users directly to relevant insights in the data. Our proposed VizRec system automates aspects of visualization design and recommends visualizations incorporating both the data characteristics and the diverse tasks representing users’ goals and intents using a knowledge-based rule engine. To ensure the correctness of knowledge-based rules and the dynamic nature of the rule engine, a formalization theory is proposed. We implemented our proposed model into a working tool for the exploration of complex production data generated from the manufacturing processes for the German excellence cluster “Internet of Production”. Our tool was able to generate recommendations capable of visualizing data insights regardless of the domain it is used in. The existing systems face several shortcomings ranging from the number of dimensions they can handle to the limited number of supported visualizations. The recommended visualizations derived from our proposed system were able to mitigate such challenges by adopting a generic rule-based approach incorporating visualizations handling multidimensional data. We evaluated our system in multiple fields, including the engineering domain, where production systems datasets are used to achieve user-intended tasks and gain valuable insights.


Introduction
In the current era of Big Data, the evolution in the process of explanatory data analysis, for extracting patterns and inferring knowledge from raw data, has been a result of an increased ability to collect and generate a wide variety of complex and high-dimensional datasets. One proven method to efficiently communicate, comprehend, and interact with this large amount of information is Data Visualization. However, the increasing dimensionality and the growing volumes of the data pose a challenge to the current data analysts to visualize multidimensional data and unfold the hidden information [1]. Though there are sophisticated tools and techniques for visualizing data, domain knowledge and extensive manual efforts are needed to generate meaningful visualizations which are effective for a particular set of tasks. As an aid to these challenges, visualization recommendation (VizRec) tools are developed to help users explore data more efficiently even without specific domain knowledge. However, the VizRec ecosystem is still in its infancy [2]. Beyond the " ShowMe " feature in Tableau and the " Explore " function in Google Sheets, VizRec systems are unheard of by most analysts. Some current literature proposed tools in this direction [3][4][5]. However, most of these applications suffer from various limitations, especially in handling high-dimensional data. These systems are either single-task systems supporting a specific analytical task or multi-task recommenders based on simple statistical metrics for rendering "interesting" visualizations to the end-users.
In this paper, we highlight such challenges faced in the domain of manufacturing processes. With the advent of Industry 4.0, such processes are undergoing digital transformations under data-driven production environments. This has resulted in the exponential growth of data that is collected, stored and analyzed to gain valuable insights. The increase in the dimensionality of data also results in various challenges for data visualization [6]. Visualizations based on more than three dimensions of data require efficient ways to display the provided data such that the visualization is easy to comprehend as well as the data itself does not lose its meaningful aspect. This is mainly because human cognition limits the number of data dimensions that can be visually interpreted. The potential amount of overlapping data points projected onto a two-dimensional display hinders the interpretation of meaningful patterns in the data.
For example, in the case of high-dimensional production data, having hundreds of attributes, the analyst, willing to explore the data to find meaningful insights, often faces the challenge of determining the starting point of his analysis. Once that is determined, the next challenge is to generate multiple individual visualizations for different parts of the dataset. This would then lead to the final challenge, where the analysts have to filter out only the effective visualizations. This process is not only time-consuming but also computationally expensive, and most of the time it requires the expert domain knowledge of the analysts. To summarize the above challenges, before the start of any visual exploration tasks, the analysts have to determine the appropriate results based on the following questions: • Which type of visualization is best suitable for the given tasks? • Which attributes in the dataset is selected for the visualizations? • What type of data transformations should be performed to generate the appropriate visualizations? • What should be the optimal visual encodings that can be applied for mapping the data to its corresponding visual structures?
The answers to these questions are often non-trivial for production engineers without expert data-analytical knowledge. To mitigate such challenges, visualization recommendation systems prove their efficacy by automating the process of generating meaningful visualizations that help users to explore and understand their data more effectively.
Over the past few decades, a noticeable amount of work has already been done in this field. Tools and methods have already been introduced to visualize the data and automate the process of visualizations [3][4][5]. However, these techniques are mainly focused on particular visualization forms.
Most of the recommendation systems are designed for the sole purpose of selecting the most relevant features in the data and, therefore, support a limited number of visualization techniques. For example, SeeDB [3] aids the users in identifying interesting visualizations using a deviation-based metric, yet it only uses bar charts to display the result, and other recommendations, whereas DEEPEYE [4] only uses four common visualization techniques, namely bar charts, line charts, pie charts, and scatter charts. SeekAView [5], also has a fixed number of visualization techniques to select the useful types of trends from, including frequency plots, scatter plots and parallel coordinates. The drawback of using a low number of visualization techniques is that it is not possible to cover every type of task using the same type of visualization technique. The existing visualization recommendation systems that are proposed by the current literature face several shortcomings. These shortcomings range from the number of dimensions the recommendation system can handle to the number of visualization techniques adopted. Moreover, most of these systems require inputs from expert users having domain-specific knowledge. Visualization recommendation systems that use a recommendation engine based on supervised machine learning or neural networks [4,7] also suffer from the problem of overfitting. Training the model for the recommendation engine relies mainly on the training data, and a lack of available training data results in an ineffective model that may be biased towards the data similar to the training data.
In this paper, we first present our proposed model for monitoring production processes in collaboration with the manufacturing engineering departments in the context to the

SN Computer Science
Internet of Production(IoP) 1 cluster of excellence. For the context of this paper, we focus on the visualization recommendation engine of the proposed data exploration pipeline. We propose a novel rule-based system for the construction of such recommendation engines. We show how impartial and effective visualizations are recommended using a knowledge-based rule engine designed and tested within the IoP cluster. The recommendation system uses key factors such as data characteristics, intended task and user feedback, and the knowledge base to decide the best suitable visualization technique. Furthermore, the recommendations generated are ranked qualitatively based on several statistical properties of the data.
This paper is an extension of our work originally presented in [8]. In our previous work, we have presented a formalized approach for creating a rule-based model for generating visualization recommendations. In this paper, we apply the algorithmic contributions to a working model. To highlight the practicability of our approach in context to high dimensional data exploration for engineering processes, we have deployed and tested our tool within the production systems environment. We briefly overview our work in context to the IoP cluster and describe the proposed data exploration pipeline. We present the recommendation tool we constructed as a proof of concept for our proposed methods. Finally, we highlight the efficacy of our proposed approach for data exploration tasks and present the results of our user studies conducted within the IoP research cluster.
Our contributions in this paper are summarized below: 1. Classification of data into characteristics and proposing a formal visual taxonomy: The data is categorized based on several factors such as the type of data (discrete, continuous) or its format (e.g. Numerical, Categorical). These factors are required as they influence the type of visualization technique to be used and are therefore essential for the development of the rule engine. A formal visual taxonomy is proposed as well, which provides the theoretical foundation for the construction of the knowledge base. 2. Mapping user tasks to visual structures and creation of the task-based visual taxonomy: The intended user task, required to generate visualization recommendations, is abstracted in the form of a task-based visual taxonomy which in turn is mapped on to the type of visualization techniques to generate the recommendations.

Creation of Rules for Knowledge-based Rule Engine:
After the categorization of data based on its characteristics, knowledge-based rules are generated to pro-vide recommendations. The input factors for the rules, apart from data categories, include aspects such as the intended task of the user and the number of dimensions to be visualized. Based on the input factors, the rule engine then decides the best suitable visualization techniques in the form of recommendations. 4. Ranking of Visualization Recommendations: The recommendation system generates a set of visualizations as multiple rules could be applicable. Therefore, a ranking algorithm is implemented based on task-dependent statistical measures so that the visualizations are sorted in a descending order based on their scores, ensuring that the most useful visualizations are displayed first to the user resulting in an efficient process. 5. Evaluation of the designed system: The evaluation was conducted in a two-fold approach. Firstly, we tested a sample scenario to evaluate the usefulness of the generated and ranked recommendations and compare the results with a popular visualization tool. Next, we conducted a user study with production engineers without extensive data-analytical expertise.

Related Work
As discussed in the previous section, there has been significant progress in the research and development of visualization recommendation systems in recent years. However, the design principles upon which these systems are built differ from the traditional product recommendation systems. Traditionally classical recommendation systems are classified mainly into two groups [9]: (i) content-based filtering and (ii) collaborative filtering. However, both these approaches are based on historical data about the recommended items from the past. In the case of VizRec systems, visualizations are recommended based on dynamic datasets consisting of a varied set of tasks. Even the analysis of the same dataset may give rise to a completely uncorrelated set of tasks based on the varied exploratory need of the concerned user. Hence the historical data consisting of the recommendation ratings are often sparse, resulting in the " cold-start problem " [2]. Hence these traditional methods are not suitable for designing the VizRec systems. Studying the current literature, we concluded that most of the proposed systems are based on the " Knowledge-Based filtering approach ", where rules are created based on several properties such as (i) data characteristics, (ii) intended task, (iii) domain knowledge, and (iv) user preferences based on the exploratory objectives. According to [10], such systems can be classified based on one of the four following strategies: • Data Characteristics Oriented: Recommendation systems based on this strategy focus primarily on the characteristics and type of data to generate visualizations. The data attributes are used to create visual marks for the final visualization. A key feature of this approach is the formalization of visual mappings from data characteristics to visual marks. The work done in this field includes VizQL [11] (used in the Show Me module of Tableau) and Vega-lite [12]. Both provide formal declarative specifications to convert the data characteristics into visual mappings. These include mappings such as selecting the x and y axes dimensions, the data type, the mark type to be used and the summaries (such as the mean) to be displayed. These formal mappings are then used to create rules that can be applied to generate useful visualizations based on the dataset provided. • Task Oriented: This strategy uses the concept of intended task to visualize the data. These intended tasks may include identifying data relationships such as correlation, comparison, distribution etc., and the type of visualization technique to be used by the recommendation system relies heavily on this. Most of the current studies in this area create the user task list manually [10] and then apply the data characteristics approach based on the chosen intended task. • Domain Knowledge Oriented: The research in this area focuses on improving the recommendations based on the domain knowledge. This is done by employing the task and data in the vocabulary of the problem domain to satisfy the user requirements in that specific domain [10]. This can be done based on knowledgesharing or gaining domain knowledge from existing knowledge bases. RAVE [13] uses NASA's domain knowledge along with user-selected tasks or visualization types to generate a meaningful visualization. RAVE can generate visualizations such as a 2D scatterplot, bar graph and pie chart using its knowledge base. Semantic-based recommendations in the form of ontologies are also a key research area for domain knowledge-based recommendation systems. • User Preferences Oriented: This approach explicitly requires user input to decide the user preference and generate recommendations accordingly. This can, later on, be used to improve the system. The visualization recommendation systems, using this strategy, constantly analyze and record the actions performed by the user to generate visualizations so that the recommendations preferred by the user can be displayed rather than the irrelevant ones [14]. Recent work in this area employs machine learning approaches to steadily improve the model and prune out the irrelevant recommendations [15].
The following sections in this chapter include details regarding the recent studies and work done in the field of visualization recommendation systems based on one or more of the recommendation strategies discussed above.

SEEDB
Based on these recommendation strategies, popular systems have been developed in recent times. For example, SEEDB [3,16] is a recently developed visualization recommendation system. Using a subset of the data extracted from a query, SEEDB is able to generate visualization recommendations that it regards as useful based on multiple metrics. However, SEEDB does not support multiple types of visualization techniques based on the data characteristics or intended tasks and instead generates recommendations only in the form of bar charts. The problem with this type of approach is that bar charts are only able to visualize a certain aspect of data characteristics and tasks, such as comparison in magnitude. Another drawback is the dependency on the user for query generation or selection of dimensions. For this aspect, the user must have some domain knowledge regarding the dataset or expertise in the visualizations.

SeekAView
SeekAView [5] is a visual analytics system that allows users to create subspaces from a high-dimensional dataset. It also provides suggestions for the users to reconfigure their views and identify interesting insights from the generated suggestions. Manual effort is required from the user to detect dimensions that show an interesting pattern. Despite being a very useful tool for high-dimensional data visualization and identifying important data trends, SeekAView is subject to deficiencies as it recommends only a specific set of visualizations like density plots for the dimensions, while PCA scatterplot, parallel coordinates and a scatterplot matrix are only used for the resulting subset. The fact that SeekAView displays density plots for every dimension make it complex to use for data analysis for high-dimensional datasets. Therefore, domain knowledge and a high amount of manual effort are required while using it to analyze the density plots for important dimensions, whereas some key relationships between dimensions may never be identified due to the use of a low number of visualization techniques.

DeepEye
DeepEye [4] is a visualization recommendation system developed recently and employs machine learning to generate recommendations. The motivation for using this SN Computer Science approach is to solve three individual problems. First, to decide whether an individual visualization is useful or not. Second, to compare two visualizations and select the better one and last, in the case of multiple visualizations, to find the top-k ones in a dataset. Though DeepEye proposes an effective approach toward visualization recommendation system as it uses a hybrid system based on machine learning and expert rules, it is only able to suggest recommendations in the form of pie charts, bar charts, line graphs and scatterplots which are not sufficient to cover all types of useful visualization recommendations. Using machine learning models to decide the usefulness of visualization depends considerably on the training data that is used. Therefore, datasets from diverse fields are required for the training of an unbiased model, whereas expert rules for only four visualizations techniques are not enough to identify all possible types of intended tasks VizML VizML [7] is another Machine Learning approach for visualization recommendation systems. VizML generates visualization recommendations using Neural Network and Baseline models trained and tested on one million visualizations taken from the Plotly Community Feed [17]. Data collected from Plotly is cleaned by removing the duplicate visualizations. The corpus is then used for feature and design choice extractions which are subsequently used to train the models to predict recommendations. Although VizML is an efficient machine learning approach to visualization recommendation, it consists of a few limitations. VizML utilizes a training dataset obtained from Plotly [17] only, and while the dataset is large enough, it would still be biased towards Plotly. Therefore, datasets from diverse data sources and fields should be obtained and used for training the models. Another disadvantage of using Plotly datasets is the fact that it is used by both expert and nonexpert users to create visualizations. Visualizations created by non-expert users are prone to error and may not be that useful. In the survey paper by Zhu et al. [18], the authors have presented a summarized chart depicting the overall Viz-Rec ecosystem. This summarization helped us to identify the current research gap. From the overview, we can see that all the existing systems are dependent on the data (data source and the data type) as well as the design space (number of supported visualization types). To bridge this gap, we proposed a recommendation model which is independent of such constrictions. Our proposed formalism for constructing a VizRec system provides a generic approach for incorporating new types of data and visualizations supporting a varied range of exploratory tasks.

Exploratory Pipeline for Monitoring Production Processes
The popularity of Industry 4.0 has motivated production domains to move toward the concept of Smart Factories. Such environments require tools and methods to be integrated through data management strategies. However, in today's Internet-of-Things (IoT) driven world, most production factories are adopting cross-domain data exchange strategies to derive value and improve productivity. Sharing domain knowledge, models and data across all relevant domains would provide the ability to increase productivity and the agility to report, predict and provide recommendations in real-time. To handle such requirements, a real-time dynamic version of the static historical factory data, enriched with information from physical processes, has been proposed-termed as the Digital Shadow [19]. For the purpose of implementing a conceptual reference infrastructure which, in turn, would enable the creation and application of Digital Shadows, the IoP Cluster of Excellence was established, comprising computer scientists and mechanical engineers encompassed in an interdisciplinary ecosystem. Some applications of digital shadows for production systems can be found in the current literature [20,21]. However, no work has been done till date to capture and model high-dimensional production data and processes it through a complete exploratory analytical pipeline by extracting the data and the intended task characteristics, which are then visually encoded to generate meaningful recommendations. Such a system would help the analysts, with or without domain expertise, achieve better insights, reasoning capabilities and decision-making abilities for engineering problems, enhancing process monitoring, in turn leading to improved productivity.

Data Exploration System
The requirement in the IoP cluster of excellence was to provide a platform for performing exploratory analysis on handling high-dimensional data. Generating data visualizations can often be considered a trivial task; however, there is a need for further development of visualization methods for handling large-scale, high-dimensional data. This is mainly because in high dimensional settings, the commonly used visualization techniques suffer from visual clutter and generating visual structures proves to be computationally costly [22]. A significant amount of work in the current literature has been proposed in the direction of dimensionality reduction for better visualization models. However, projecting high-dimensional data into spaces with lower dimensions fails to preserve the intrinsic structure of the data and hence suffers from the loss of information and interpretability. To address these challenges, our proposed system provides an automated approach to converting the high-dimensional data from production systems into knowledge through intuitive visualizations.
In Fig. 1, we present an overview of the system architecture of our proposed exploratory pipeline. We see that our architecture comprises of three separate layers: (i) Data Management Layer, (ii) Knowledge Management Layer, and (iii) Data Exploration Layer.
Data Management Layer. Since the generated data from production systems are mostly high dimensional, data reduction strategies are needed to improve the performance of the analytical models. The raw data generated from the internet of things (IoT) application are ingested into the system along with the user inputs for the indented exploratory tasks. The main functionality of this layer is to determine an optimal solution for minimizing information loss while performing dimensionality reduction. For doing so, as the first step, the ingested data has to undergo data preprocessing and cleaning before using it as input to our system. After this, we apply our proposed unsupervised feature selection method, which extracts relevant data attributes in an automated manner. Finally, the selected features are mapped into visual structures.
These are transferred to the next layer (Knowledge Management Layer) of our system through the Visual Structure Interface.
Knowledge Management Layer. After receiving the visual structures from the previous layer, the first step in this layer is to construct visualization models based on the visual mappings. Visual mappings maps the relevant data features into visual structures (visualization axes, markers, graphical properties). In the next step, the generated visualizations are evaluated based on predefined visualization quality metrics. Based on the evaluation results, the visualizations are ranked. The rationale behind this step is that the ranking of the visualizations is done based on a threshold that is obtained from the evaluation step. This helps our system to reduce the search space by filtering out visualizations with low information content. By extracting the data characteristics and mapping them into appropriate metrics in the image space, we generate rules which are stored in a knowledge-based system. These rules are used to train our model for improved performance in terms of better accuracy for generating visualization rankings. Finally, the visualization rankings are fed as input to the data exploration layer.
Data Exploration Layer. The application of the previous layers in our system deals with the data reduction and visualization filtering strategies. However, the selection of the most appropriate images based on the intended tasks SN Computer Science remains a challenge. Hence in this layer, we integrate our proposed visualization recommendation engine. It takes the extracted data characteristics and the user intended tasks as inputs, along with the visualization rankings from the previous layer. The workflow then generates rules which are used for generating recommendations. These rules are stored into the knowledge base for automating the process and improving the scalability of our system while handling high-dimensional data having big data properties.
In this paper, we elaborate on the rule-based recommendation strategy adopted in the Data Exploration Layer of our proposed model. We incorporated the proposed model into a working prototype, which serves as an automated exploratory tool for monitoring production engineering processes and recommending effective visualizations.

Rule-Based Visualization Recommendation System
In this section, we present our proposed model of the recommendation system. Our system is built on top of a rule engine constructed by modelling data characteristics and intended tasks into generic rules. The data characteristics axis is used to identify key characteristics of the data, such as dimensionality, data type and data format, while the intended task axis takes into account the goal of the user based on the type of data relationship. As a first step, data characteristics are extracted automatically from the input dataset using the data abstraction module, while the intended tasks are mapped using the task abstraction module. These two different sets of inputs are used for the construction of rules which are modelled as generic rules and are stored in the Knowledge Base. During the execution, dynamic data and task encodings are generated, which are looked up in the Knowledge Base to retrieve the corresponding rule sets. The Rule Engine then ingests these rules to trigger a specific event which consists of a set of visualizations. These visualizations are then ranked and displayed to the end-user as recommendations. The complete workflow of the developed system is given in Fig. 2.

Workflow
As a first step for building our model, we construct a forward-chained Knowledge Base, which is used for storing recommendation rules. We justify using a knowledge base for handling the cold start problem and ensuring that the system is unbiased and generic, regardless of the domain it is used in. The Knowledge Base stores rules which are based on the data characteristics (extracted in the Data Abstraction module) and the intended tasks of the user which are mapped from the user input in the task abstraction module (see Fig. 2).
Data characteristics include aspects such as the number of dimensions required for visualization, the type of data in the dimension to be visualized (discrete/continuous/temporal) as well the data format (numerical/categorical). Whereas the intended task includes the type of relationship, the user wants to explore within the dataset through the generated visualizations.

Data Abstraction Module
To generate visualization rules based on data characteristics, we first create a chart vocabulary consisting of Prior to the selection of the visualization types, we conducted a survey consisting of visualization experts, production engineers and data analysts to identify the commonly used visualization charts. We wanted to include charts used for visualizing one-dimensional to n-dimensional data (where n represents high data dimensionality). Hence, for the proof-of-concept, we selected the aforementioned 22 popular visualization charts as listed in Table 1.
The functionality of the data abstraction module is to automate the process of extracting data characteristics. These characteristics consisted of the data type (e.g. discreet, continuous, temporal), the data format (e.g. numerical, categorical, relational, part_to_whole) and the data dimensionality. Later in "Visualization Recommendation Tool", we describe the process along with the implementation details of the module for our proposed tool.
The Chart Vocabulary aims to map multiple types of visualization techniques to the characteristics of data with varied dimensions. Next, we will explain how we use our proposed chart vocabulary to construct the rules for the Knowledge Base.

Task Abstraction Module
In Fig. 2, we see that the intended tasks are encoded into rules which would later be used to establish a mapping between user tasks and recommended visualizations. The intended task is a significant decision-maker in the selection of the type of visualization techniques. To ensure that no expertise domain knowledge is required for the recommendation system, abstraction of the intended task from the user is performed. The process of task abstraction is modelled in a tree structure as shown in Fig. 3. This process consists of multiple levels of abstraction, beginning with a set of easy-to-understand questions for the user, which is the only input required from the user. The remaining levels are constructed automatically within the recommendation system. We map the user inputs to the abstracted tasks for the following questions: • Do you want to find the distributed range, average or extrema of data dimensions? • Do you want to split data into categories for filtering and analysis? The response is then mapped to high-level tasks comprising of Distribution, Part-to-Whole, Change over Time, Comparison and Relationship. These high-level tasks were further mapped to several atomic tasks. The atomic tasks provide finer granularity over the generic user-intended tasks and are used to establish relationships between the user tasks and supported visualization. Next, we describe how these relationships are captured into generic rules that are stored in the Knowledge Base of the recommendation system.

Rule Construction
In this paper, we present a formal theory of generating recommendation rules which are stored in the knowledge base. The purpose of the formalization of rules is to help design a less error-prone and more efficient knowledge base in a dynamic manner which ensures further rules can easily be added to the rule engine. This approach is a more flexible one as compared to a hard-coded static knowledge-based rule set. For defining the building block of our rule set, we introduce the concepts of Attributes. We define Attributes (A) as the state of the system based on which a rule is fired. Every attribute is assigned a value from a specified domain set D, where D = D 1 ∪ D 2 ∪ … D n and D i is the finite domain for attribute A i ( A i ∈ A, i = 1 … n ). Attributes are further divided into two types, atomic and composite. Formally, the distinction between atomic and composite attributes is given as A = A a ∪ A c , A a ∩ A c = � . An atomic attribute can only be assigned one domain value at a time and represented by the function: A i ∶ V → D i , while a composite attribute can be assigned multiple domain values and represented by the function: where V i is the current value for attribute A i and V i ∈ D i (atomic attributes) or V i ⊆ D i (composite attributes). Unknown or unspecified state value for an attribute A i is denoted as A i = null.
Following is an example to further explain attributes within our proposed visualization recommendation system. We define the set of attributes in the visualization recommendation system as: The domains are formulated as: The attribute dimensionality is an example of a composite attribute and can be further divided into low dimensionality ( L D ) and high dimensionality ( H D ), where L D ⊂ D dimensionality , L D ≤ 5D , H D ⊂ D dimensionality , H D > 5D and L D ∩ H D = � . The attributes data_type and data_format The intended_task attribute is also a composite attribute, e.g. for the intended task of comparison of part data to total, the intended_task attribute will be {Part-to-Whole, Hierarchi-cal}. The last attribute consists of a set of recommended visualizations for the user and is a composite attribute as well, having one or multiple recommended visualizations as its value. Using the above formalization of the rule engine, we define the generic rules for the recommendation systems that are constructed and stored in the Knowledge Base.
Below we show a sample set of our constructed rules: Based on the above rules, if data abstraction attributes (dimensionality > 3) ∧ (data_type = {discrete}) ∧ (data_format = {numerical}) are provided as the current state of the system, rule r 3 is fired and the generated recommendations set is {scatterplot}.

Knowledge Base Construction
The above formalizations for the rule generations are extended for the construction of the Knowledge Base. We propose a formal design principle, using which the Knowledge Base could be easily extended to incorporate rules on the fly.
Firstly, we define a rule r = (COND, DEC, ACT) , where COND is the conjunction of a set of conditions that are to be fulfilled for the rule to be fired. It is represented as: where i ∈ COND, i = 1 … n . DEC is the decision part of the rule, which assigns values to attributes when the rule is fired and is represented as: where i ∈ DEC, i = 1 … m and DEC is the set of all decisions. ACT is the transition in the system by performing certain actions based on the rule. Hence, we define an atomic rule r as: where LHS(r) is the conditional part of the rule, RHS(r) is the decision part of the rule, and DO(ACT ) is the independent set of actions performed by the rule.
An ordered set of knowledge-based rules having the same rule set schema can be grouped in the form of a table. A table is formally defined as: In the case of multiple tables, such as the decision tables for task abstraction to extract the intended task for the recommendation system, the concept of inter-table connection links is applied.
A connection link is an ordered pair: c = (r, t), c ∈ R × T , where R is the set of rules in the knowledge base, and T is the set of tables. Each individual rule has its own connection link, and based on the rule fired. The respective connection link transfers control from the table containing the rule to the next connected table. All the tables and connection links grouped together to form a knowledge base for the recommendation system. The complete knowledge base can formally be represented as: K = (T, C) where T is the set of all tables in the knowledge base and C is the set of all connection links that connect rules within T In context with the visualization recommendation system, the set F represents the set of features of the dataset imported by the user. The set D T represents all possible supported data types for a feature f i where D T = {Discrete, Continuous, Temporal} and f i ∈ F, i = 1 … n . The set D F represents all possible data formats for a feature f i and is defined as D F = {Numerical, Categorical, Relational, Part-whole} whereas the set I T contains the list of intended tasks supported by the recommendation system and defined as I T = {Distribution, Part to Whole, Change over time, Comparison, Relationship}. The intended task is extracted in the form of a decision tree (represented by Fig. 3) with the help of user preferences to ensure meaningful and relevant visualization is recommended. The type of visualization techniques supported by the recommendation system is represented by the set where v is the total number of visualization techniques the recommendation system can generate. Details regarding these visualization techniques and their data characteristics are provided in the Chart Vocabulary.
The structure of the knowledge base, based on the constructed rules, is shown in Fig. 4. The knowledge base consists of a decision tree module (Fig. 3) for the intended task as well as the main table consisting of knowledge-based rules. The connection links, c1, c2, c3, c4 are used to transfer control between the tables. The attributes in the main Based on this above schema, we construct the knowledge base. The rules stored within the knowledge base are constructed based on the formula discussed above. When the new data arrives, the data type, the data format along with the intended task is extracted. If the encodings for these respective variables are not present in the knowledge base, then new rules are constructed based on the above formalisms. These newly constructed rules are then stored within the knowledge base according to the above schema.

Generate Recommendations
Algorithm 1 presents our approach for generating the visualization recommendations based on the Knowledge-Based rules. The Knowledge Base constructed based on the model described above is used by the rule engine to generate a list of recommended visualizations. The input parameters for the algorithm are the list of generated rules, data characteristics extracted automatically from the input dataset, and the intended task mapped using the task abstraction module. The algorithm then matches the provided parameters with the stored rules in a dynamic manner and adds visualization to the recommendation list only if all parameters match the entire relevant visualization rule fields. For example, Histogram is added to the list if the dataset contains at least 1 Dimension, Data Type: Continuous, Data Format: Numerical and High-Level Task: Distribution. The list of recommendations is then sent to the Ranking module to sort them based on the ranking scores assigned to each of the visualizations and then displayed accordingly to the end-user. Based on the user's intended task, the relevant ranking metric (normal distribution, gradient or correlation) is selected along with its respective threshold value. A score is calculated for the dimensions within the dataset and the dimensions outside the threshold score are automatically pruned out. The remaining dimensions are then sorted based on the score to ensure the highest-scoring dimensions are displayed first to the user. The calculation of such a score makes it easier to filter out meaningless visualizations and sort the remaining visualizations based on their score ensuring the user finds the most useful recommendations first.

Visualization Recommendation Tool
In this section, we discuss the architecture and implementation of the proposed visualization recommendation tool based on the theoretical solution approach discussed above. First, the architecture and system overview is discussed and later on, the implementation of each module is explained in detail.

System Architecture
The architecture of the recommendation system is summarized in Fig. 5. The recommendation system is divided into two main components, the front-end UI and the backend API server. These components are further modularized into several smaller components in a loosely coupled manner to ensure they are easily managed in terms of future changes. The web-based front-end UI has been developed in React javascript [23] and is responsible for importing the dataset in .csv format, displaying summarized data overview and converting generated recommendations into visualizations. The library used in React for generating visualizations is D3.js [24]. The styling of the front-end is done using Material-UI [25], a React framework library. Communication between the front-end and the backend is done using REST APIs using the Flask [26] library for Python. The backend consists of several smaller modules, including the data abstraction layer (responsible for extracting data characteristics from the dataset), the rule engine (responsible for generating recommendations) and the visualization data module (responsible for generating data to be visualized on the front-end as well as ranking the visualizations).

Implementation Details
This section discusses each developed module and its functionality in detail. Starting with the backend modules, first, the rule engine and knowledge-base modules are discussed in terms of implementation. Next, the data abstraction module is described, and finally, the visualization data module is explained. Subsequently, the web-based UI and its implementation, along with the front-end visualization library module, are discussed in detail at the end.

Rule Engine
This module is the core module of the recommendation system. The knowledge base and rule engine were designed based on our proposed rule engine construction theory. The reason for this is to create a dynamic knowledge base for the rule engine. This would enable a seamless integration of additional rules and visualization types.
The architecture of the knowledge base is shown in Fig. 6. The knowledge base uses domains stored within the system for each field to ensure the defined rules' correctness. The knowledge-base would ignore any fields not included in the relevant domain. The existing domain values that are incorporated into our tool are as follows: In Appendix A, we present the overview of the generated rules that are stored in the knowledge base. We can see that our rule engine supports custom operators such as * , ∕ and −. For example, in a field with value * , the algorithm loads all values in the field's domain for the knowledge-base rule, e.g. Rule 7 in Table 2 has * as its data type for Heatmap in Appendix A. This means that the algorithm would load Discrete, Continuous, and Temporal in the field data type of the knowledge-base rule for Heatmap. If the field contains the / operator, all values divided by the / operator are added to the knowledge-base rule for the relevant field, e.g. Rule 11 has the data type Continuous/Temporal for Multiline Graph. The algorithm would split the value by / and add both Continuous and Temporal in the field data type for the knowledgebase rule. Finally, the operator -is used as a range operator, e.g. for Rule 8, the dimensions for Scatterplot are 2D-4D. The algorithm would then add all values from the domain starting from 2D and ending at 4D (2D, 3D, 4D) to the field dimensionality of the Scatterplot knowledge-base rule. The data characteristics and intended task are sent from the UI to the server via REST API. The system matches the provided parameters with all knowledge-base rules in a dynamic manner and adds visualization to the recommendation list only if all parameters match the entire relevant visualization rule fields, e.g. Histogram is added to the list if the dataset contains at least 1 Dimension, Data Type: Continuous, Data Format: Numerical and High-Level Task: Distribution (Rule 1 in Appendix A). The list of recommendations is then sent to the Visualization Data Module to prepare the data, which is used to generate corresponding visualizations on the UI.

Data Abstraction
The purpose of creating a Data Abstraction Module is to automate the process of detecting different types of data characteristics such as Numerical Discrete/Continuous, Categorical and Temporal dimensions within the provided dataset without the need for user input. This removes the requirement of expert domain knowledge regarding the dataset and allows the generation of automatic recommendations with minimal user input. This module is implemented using  3.0 and used by the server to preprocess the provided dataset, extract useful characteristics from it and generate a summarized overview of the dimensions within the dataset.
The input parameters for this module are the input CSV file and optionally selected dimensions in case the user wants to refine and remove unwanted dimensions (all dimensions are selected by default). The output values are lists of categorized dimensions, including categorical, discrete, continuous and temporal dimensions. The dataset is first cleaned and preprocessed, such as dropping empty or serial number columns and filling in sparse columns with zeroes for numerical or mode values for categorical values.
The cleaned data frame is then iterated over each column. To detect categorical columns, the percentage of unique values to total values is calculated and compared with a threshold value. The initial threshold value is set as 5% as, based on the testing and evaluation, categorical dimensions within datasets did not exceed this threshold. However, this can further be optimized if required. The category name and existing values are appended to the categorical dimensions list if the percentage is less than the threshold value. To detect temporal dimensions, the column values are compared to a predefined list of DateTime format regular expressions. If the column matches one of the formats, it is appended to the temporal dimensions list in the matched DateTime format. Numerical continuous dimensions are calculated if a column has non-integer numerical data type or if it contains ordinal numeric values in a range, while discrete numerical dimensions are detected if a column contains unordered integer values. The categorized lists of dimensions generated in output are automatically displayed on the UI when the user imports the dataset so that the user is able to remove any unwanted dimensions from the dataset. The dimension lists are also sent to the rule engine and combined with the intended task of the user to fire relevant knowledge-base rules and generate useful recommendations.

Data Visualization
The Visualization Data Module is responsible for generating data in a format that can be processed on the front-end to generate visualizations, as well as for the ranking of recommendations. Visualizations are ranked using our proposed ranking system as described above. Based on the user's intended task, the relevant ranking metric is selected along with its respective threshold value. A score is calculated for the dimensions within the dataset and the dimensions outside the threshold score are automatically pruned out. The remaining dimensions are then sorted based on the score to ensure the highest-scoring dimensions are displayed first to the user.

User Interface and Visualization Library
The user interface(UI) for the implemented tool is developed entirely in React js framework. To generate multiple types of visualizations, a custom visualization library module has been developed using the d3 javascript library. The visualization library currently supports 16 types of charts (all types given in the Chart Vocabulary in Table 1). The Visualization Library Module uses the processed data received from the Visualization Data Module as input and converts it into corresponding visualizations on the front end. Figure 7 shows the complete web-based UI. The three distinct sections in the UI are labelled as 1, 2 and 3. (1) The system currently supports .csv data files which are imported on the web UI and sent to the server to generate a summarized overview. The intended task is selected based on the set of abstracted questions using a drop-down menu and mapped accordingly to lower-level tasks for the rule engine. (2) A custom filter has also been implemented in the UI to give more control to the user to remove dimensions deemed less useful, and the summarized overview is updated accordingly.
(3) The ranked recommendations in the form of Scatterplots, Bubbleplots and Parallel Coordinates are generated and displayed on the UI, with color used for the category Species.

Evaluation
To show the usefulness of our approach, we analyze the use case with the Singapore Airbnb dataset 2 . We show the recommended visualization for some of the intended tasks. The generated ranked recommendations were then compared with the recommendations generated using the "Show Me" feature of Tableau 3 .
Experimental setting. We have conducted our experiments in a server running Ubuntu 14.04, with two Intel Xeon X5647@2.93GHz CPUs (8 logical cores/CPU) and 16G RAM. Because of limited space, we describe the evaluation result only from one use case. However, the extensive evaluation report with two more usage scenarios can be found here (https:// figsh are. com/s/ ef97e b5e7e 26374 e45a1).

Singapore Airbnb Data
The Singapore Airbnb Dataset contains multiple factors that influence the price of the room available. It consists of 15 influencing dimensions, one target dimension (price) and 7922 rows. The data abstraction layer automatically discards dimensions such as the serial number (id) column and categorizes the remaining dimensions based on their

Task: Comparison
In Fig. 8, we present the snapshot of our Recommendation Dashboard. The Heatmap Matrix and the Matrix of Bar Graphs were the highest-ranked visualizations recommended by our system. As discussed before, the user input for the system is two-fold: (i) importing the dataset and (ii) selection of the intended task. Once completed, the "  Recommendation " button triggers the backend engine to perform data preprocessing and generate mappings according to the selected tasks. These mappings were then used by the rule engine to fetch rules from the knowledge base, generate and rank recommendations and finally display them in the visualization dashboard.

Task: Distribution
The recommendation system generates a set of visualizations and presents them in the constructed dashboard, similar to the ones generated for the comparison task as shown in Fig. 8. For the lack of space, we show only the top-ranked visualizations, which highlight the distributions of important variables in the dataset. A histogram matrix and a boxplot matrix, as shown in Fig. 9 are the two top-ranked visualizations. The histogram matrix contains a set of histograms for eight useful dimensions. Analysis regarding multiple factors such as price, availability over the year, minimum nights and latitude/longitude of available places can be made using this matrix. Whereas, the boxplot matrix are generated using both categorical dimensions, the neighbourhood dimension and the room_ type dimension. The neighbourhood category is split into five regions of the city, while there are three types of rooms (private, shared, and entire apartment) within the dataset. Our system recommends the boxplots, which can easily be analyzed to show that the Central Region of the city has the highest average and maximum prices while the North-East Region is the cheapest available option for a room.

Task: Relationship (Clusters/Correlation/Outliers)
To identify existing relationships within the dimensions of the dataset, the recommendation system generated a Matrix of Heatmaps and two Parallel-Coordinates charts for both categories, shown in Fig. 10. These visualizations can be used to detect the existence of clusters, correlations or outliers within the dataset. For example, the heatmap matrix could identify the trend in the increasing price of listings moving from the North Region (less expensive) to the West Region (more expensive) and the magnitude of the price difference between each room region-wise. In addition, using

Comparison with Tableau
As a qualitative evaluation for our proposed system, we compare the generated visualizations for the same task with the  Figure 11 shows the result with the reviewsper-month and year-of-last-review on the x-axis, average values of multiple dimensions on the y-axis and room-type as color. To generate other meaningful visualizations, manual dimensions (room type, neighbourhood group and price) were selected, and the system recommended a Multi-bar graph for each type of room using color for the region and the average price on the y-axis (see Fig. 11). Similar visualizations were recommended by our system, as shown in Fig. 8 along with multiple other ranked visualizations. Our proposed recommendation system performed better in terms of qualitative analysis of results compared to Tableau by not only generating the visualizations recommended by Tableau but also generating multiple other ranked visualization charts based on various statistical metrics and user-intended tasks. Therefore, we conclude that the recommendation system presented in this paper proves its efficiency in generating automated visualizations in turn providing meaningful data insights.

User Study: Production Engineering Use Case
The Production Systems Engineering Dataset generated from the manufacturing engineering department within the IoP research cluster is used as a ground truth evaluation of the recommendation system. The reason for such an evaluation is to test the performance of the system within the engineering domain. A summary of a sample production systems engineering high-dimensional dataset (18 dimensions, 4461 rows) used with the recommendation system is shown in Fig. 12.
The engineers were provided with a survey questionnaire to record their results to evaluate the system's qualitative performance. The following questions were included as part of the survey: 1. Do you have experience with Data Analytics? 2. Do you have previous experience with Data Visualization? 3. How accurate was the automatic detection and categorization of dimensions within the data set? (Ratings from 1 to 5) 4. Were you able to achieve your required goal in visualizing the data set using the recommended visualizations? 5. Was the list of supported tasks sufficient to achieve the goal? 6. What was the intended task of the user selected in the system? 7. How can you rate the recommended visualizations in comparison with the provided task? (Ratings from 1 to 5) 8. How relevant are the recommended visualizations for the analysis of engineering data? (Ratings from 1 to 5) 9. Were the visualization types correctly used for the provided dimensions and intended task? 10. How helpful was the automated ranking of visualizations (meaningful visualizations displayed first)? (Ratings from 1 to 5) 11. Would you prefer automatic visualization recommendations generated by this tool over manual effort?
The survey questions are designed in an easy-to-understand and systematic manner to not only classify the demographics of the users within the case study but also evaluate each module of the recommendation system. Questions 1-2 are focused on gathering data regarding the type of users evaluating the system. Question 3 evaluates the performance of the Data Abstraction Module, while Questions 4-6 judge how well the Task Abstraction Layer. The Knowledge-base and Rule Engine Module of the recommendation system is evaluated using Questions 7-9, while the Ranking system is tested and quantified using Question 10. Finally, the survey sums up the overall preference of using the automated Fig. 12 Overview of the data generated from production processes recommendation system over manually identifying insights and visualizations by Question 11. The recommendation system was tested and evaluated by a team of 6 people from the WZL 4 department and a team of 10 people from various departments related to production engineering, all of whom belonging to the IoP research cluster. Regarding the demographics, the users' field of work consisted of Process Management and Production Engineerings. Only two users had no past experience with data analytics, while the nine people had no past experience with data visualization.
The evaluation results are presented in Fig. 13. 15 out of 16 users were able to achieve their required goal using the generated visualizations (Fig. 13b), while 13 out of 16 users felt that the list of intended tasks currently supported by the system was sufficient to achieve that goal (Fig. 13c).

Conclusion
This paper presents a rule-based recommendation system that generates and ranks visualizations by extracting data characteristics and abstracting the user-intended task. An efficient Knowledge Base was presented, which mapped the data and the task abstractions as rules. As discussed in this paper, formal methods for constructing such a knowledge base provide a blueprint for modelling rule-based visualization recommendation systems. This knowledge-base oriented rule engine is efficient because it has been implemented in a completely dynamic manner for future enhancements. As far as our knowledge, no such system exists in the current literature that provides a recommendation model encompassing the entire visualization ecosystem. The automated workflow, starting from data ingestion and ending at the recommended visualizations being rendered on the screen for the end-user, is achieved by the capability of our system to automatically detect data dimensions and sort them into categories. To address the "human-in-the-loop" factor for generating recommendations, we have presented a generic task abstraction model in the form of a hierarchical tree structure that provides a mapping between the high and low-level intended tasks and the set of visualizations supported by the relevant tasks.
Future Work. The system currently extracts key data characteristics from each dimension to divide the dimensions into respective categories. This module can further be enhanced by creating a scoring-based algorithm to sort the dimensions in terms of usefulness which can then lead to filtering out irrelevant dimensions in an automated manner. Finally, due to the recommendation system being generic regardless of any specific domain field, a large set of visualization techniques and user-intended tasks have been implemented in the system. However, additional enhancements can be made to the existing set of intended tasks supported by the recommendation system as well as multiple other visualization techniques not currently supported, such as a Tree Map, to cover a wider range of user tasks and visualization techniques. We are currently working on creating a neural network model which would be trained with a wide variety of engineering data as well as tasks from various domains. A comparative study will be conducted between the rule-based and the neural-network-based models to study the efficacy of these models in recommending appropriate visualizations catering for the specific exploratory goals of the users.

Appendix A Knowledge Base Rules
See Table 2. provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.