Does design matter when visualizing Big Data? An empirical study to investigate the effect of visualization type and interaction use

Perkhofer, Lisa; Walchshofer, Conny; Hofer, Peter

doi:10.1007/s00187-020-00294-0

Does design matter when visualizing Big Data? An empirical study to investigate the effect of visualization type and interaction use

Original Paper
Open access
Published: 17 February 2020

Volume 31, pages 55–95, (2020)
Cite this article

Download PDF

You have full access to this open access article

Journal of Management Control Aims and scope Submit manuscript

Does design matter when visualizing Big Data? An empirical study to investigate the effect of visualization type and interaction use

Download PDF

11k Accesses
13 Citations
Explore all metrics

Abstract

The need for good visualization is increasing, as data volume and complexity expand. In order to work with high volumes of structured and unstructured data, visualizations, supporting the ability of humans to make perceptual inferences, are of the utmost importance. In this regard, a lot of interactive visualization techniques have been developed in recent years. However, little emphasis has been placed on the evaluation of their usability and, in particular, on design characteristics. This paper contributes to closing this research gap by measuring the effects of appropriate visualization use based on data and task characteristics. Further, we specifically test the feature of interaction as it has been said to be an essential component of Big Data visualizations but scarcely isolated as an independent variable in experimental research. Data collection for the large-scale quantitative experiment was done using crowdsourcing (Amazon Mechanical Turk). The results indicate that both, choosing an appropriate visualization based on task characteristics and using the feature of interaction, increase usability considerably.

Crowdsourcing for Information Visualization: Promises and Pitfalls

Big Data and Interactive Visualization: Overview on Challenges, Techniques and Tools

User Studies in Visualization: A Reflection on Methods

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

One of the main purposes of management accounting is to provide decision-makers with relevant information for an easy, accurate, fast and rational decision-making process (Appelbaum et al. 2017; Dilla et al. 2010; Ohlert and Weißenberger 2015; Perkhofer et al. 2019b). Being able to fulfill this fundamental task is becoming more and more difficult as market dynamics increase (Eisl et al. 2012). The consequence is most likely a distortion of current working practice. Management accountants need to expand their scope from historical data reporting to real-time data processing, from using only in-house data to the inclusion of external data sources, from traditional paper based to interactive and online based reporting, and to shift the focus from reporting the past to predicting the future (Appelbaum et al. 2017; Goes 2014). To achieve this shift, new tools and technical instruments such as algorithms, appropriate management accounting systems and reporting software (Pasch 2019) and especially interactive data visualization are necessary (Perkhofer et al. 2019b; Janvrin et al. 2014; Bačić and Fadlalla 2016; Ohlert and Weißenberger 2015).

Visualizing Big Data proves to be of great importance when problems (or tasks) are high in complexity (Hirsch et al. 2015; Perkhofer 2019), or not sufficiently well-defined for computers to handle algorithmically, meaning human involvement and transparency is required (e.g. in fraud detection) (Munzner 2014; Kehrer and Hauser 2013; Dilla and Raschke 2015; Keim et al. 2008). Visualizing data means organizing information by spatial location and supporting perceptual inferences (Perkhofer et al. 2019a). Perceptual inferences are comparatively easy for humans to draw, as their visual sense is superior (with respect to fast programmable algorithms) and data transformation to the essential stores in human memory is astoundingly fast (Ware 2012; Keim 2001; Sweller 2010). Visualization thereby enhances the ability of both, searching and recognizing, and thus significantly enhances sense-making capabilities (Munzner 2014; Bačić and Fadlalla 2016).

However, in order to optimally support the human perceptual system, an appropriate and easy-to-use visualization needs to be presented to the decision-maker (Munzner 2014; Pike et al. 2009; Vessey and Galletta 1991; Perkhofer 2019; Falschlunger et al. 2016a). Especially when data and tasks increase in complexity, as is the case with high-dimensional datasets, traditional business charts are no longer able to convey all information in one chart. Therefore newer forms of visualizations, also called Big Data visualizations, have to be taken into account (Grammel et al. 2010). Big Data visualizations are unique in their purpose and designed to deal with, and present, larger amounts and various forms of data types (Perkhofer et al. 2019b). Novel forms with the goal of presenting multidimensional data, often used in visual analytics (Liu et al. 2017; Bačić and Fadlalla 2016), range from parallel coordinates and polar coordinates plots, over sunburst-, Sankey-, and heatmap-visualizations, to scatterplot matrices and parallel coordinated views (Liu et al. 2017; Albo et al. 2016; Bertini et al. 2011; Claessen and van Wijk 2011; Lehmann et al. 2010).

By the use of Big Data visualizations, the management accountant is able to show the whole dataset within one comprehensive visualization and is therefore able to generate new insights that would otherwise stay uncovered. For gaining insight, however, the visual display alone is not enough. The user needs to be able to interact with the interface (Elmqvist et al. 2011). Interacting in this context means using filter or selection techniques, drilling down to analyze the next layer of a data dimension, or also interchanging data dimensions or value attributes (Perkhofer et al. 2019b; Heer and Shneiderman 2012). Only if the user is able to interactively work with the dataset and answer predefined questions or questions that arise during the process of analysis, Big Data visualizations can unfold their full potential and new correlations, trends, or clusters can be detected for further use (Perkhofer et al. 2019a, c).

Unlike conventional charts used in everyday life (e.g. line, pie, or bar charts), new visualizations require a close focus on design and interaction in order to be considered useful (Liu et al. 2017; Kehrer and Hauser 2013; Elmqvist et al. 2011; Pike et al. 2009; Bačić and Fadlalla 2016). Unfortunately, for both the design and use of new visualization options, and the design and use of interaction, limited empirical research is available (Isenberg et al. 2013; Perkhofer et al. 2019b). Users still have to go through cost-intensive and unsatisfying trial and error routines in order to identify best practice instead of being able to rely on empirical evidence (van Wijk 2013). This led us to identify two concrete and pressing questions in current literature, addressed in this study:

(1)
Appropriate use of new visualization types: Depending on data- and task-characteristics, some visualization types are claimed to outperform others when it comes to optimal decision-support. However, these claims are mostly based on their developers opinion or on small scale user studies rather than on experimental research (Isenberg et al. 2013; Perkhofer 2019). As multiple options to visualize Big Data are available, we limit the scope of this study to identify visualizations for multidimensional data. This is due to the fact, that it is impossible for traditional forms to show more than three attributes or three dimensions at the same time within one visualization. This, we think, highlights the importance and need of Big Data visualizations and demonstrates their benefits. Further, as a starting point to investigate Big Data visualizations we choose four frequently cited and actively used visualization types (details please see Table 1), namely the sunburst visualization, the Sankey visualization, the parallel coordinates plot and the polar coordinates plot (Bertini et al. 2011; Keim 2002; Shneiderman 1996). We wanted to investigate if one particular visualization type can outperform the other based on the three tasks identify, compare, and summarize (classification based on (Brehmer and Munzner 2013) using two different perspectives on the dataset (multiple hierarchy-levels vs. multiple attribute comparisons).
Table 1 Highly cited and used visualization types identified for multidimensional data
Full size table
(2)
Appropriate use of interaction techniques: Pike et al. claim that interaction “has not been isolated as an experimental variable” yet, therefore hindering direct causal interpretation on this highly discussed and frequently used visual analytics feature (Pike et al. 2009, p. 272). This is because most user studies concentrate on the visualization itself, while interaction is added as an integrated feature incorporated into the source code of the visual representation. Visualizations can be used and tested without interaction (as a static form), however, interaction does not work without the visualization itself (Kehrer and Hauser 2013). “Exactly what kind of degree of benefit is realized by allowing a user to interact with visual representation is still undetermined.” (Pike et al. 2009, p. 272). Consequently, to answer this claim we isolate the effect of interaction and evaluate the difference between an almost static versus a highly interactive visualization.

Performance is measured by the three components of usability defined by ISO 9241 (efficiency, effectivity and satisfaction) as well as by one comprehensive sum-score for usability described and created by the authors. For data collection, we used the crowdsourcing platform Amazon Mechanical Turk resulting in a large sample size of N = 2272. Results obtained by MTurk have been shown to be congruent with lab experiments in the context of visual analytics (Harrison et al. 2014), allowing us to believe it is an appropriate and reliable platform to test our selected visualization options. Statistical analysis was based on MANCOVA (for simultaneously assessing efficiency, effectivity and satisfaction) or respectively ANCOVA (for assessing the sum-score for usability).

Results indicate that the used visualization type and the degree of interaction have an influence on efficiency and satisfaction, while the task type primarily influences effectivity. More precisely, from a users’ perspective, information retrieval and therefore a fast and accurate decision is encouraged when being confronted with Cartesian based rather than Polar based visualization types (the Sankey visualization or the parallel coordinates plot rather than the polar coordinates plot or the sunburst visualization) and if visualizations are made accessible in a highly interactive form. For users to make effective decisions, the underlying task needs to be supported by the visualization type. For example, the task type summarize is executed more effectively if the data presented in the visualization is already condensed by dimensions (e.g. the Sankey or the sunburst visualization) while the task type identify is easier to execute if each single data entry is presented as a single property within the visualization (e.g. the parallel coordinates plot). These results can be seen as general guidelines for Big Data visualization use in the context of managerial accounting, however, also specific information on the used visualization types are presented in this experimental study.

The remainder of this paper is structured as follows: First, the general purpose of visualizations, specific visualization types and interaction techniques are discussed and the hypotheses four our experimental design are presented. In Sect. 3, the study design is laid out in detail before analysis is presented in Sect. 4. The last sections discuss and conclude our research findings, state limitations, and propose opportunities for further research in the context of interactive visualizations for multidimensional datasets.

2 Theoretical background and hypotheses

The fundamental goal of visualizations is to generate a particular insight or to execute a specific task by emphasizing distinct features of the underlying dataset (Lurie and Mason 2007; Anderson et al. 2011). Insights can either be the discovery of trends, correlations, associations, clusters, and events (that allow the generation or the verification of hypotheses), or the presentation of information to a particular audience by telling a persuasive and data-supported story for decision-making purposes (Brehmer and Munzner 2013). While telling a story mostly follows a standardized procedure such as reporting, the generation or verification of a hypothesis, in contrast, is typically ad hoc and unstructured (Perkhofer et al. 2019b). Especially in situations that ask for an ad hoc evaluation of a highly complex and large dataset, the use of Big Data visualization is of great importance (Chen and Zhang 2014; Falschlunger et al. 2016a). Consequently, users confronted with such problems have already established the use of Big Data visualizations. Pioneering examples can be found in fraud detection (Singh and Best 2019; Keim 2001), or when analyzing network traffic (Keim et al. 2006) as well as business models to reduce costs but maintain quality (Appelbaum et al. 2017). Further, Big Data visualizations are also customary in companies with a high focus on personalized marketing and social media to evaluate the implications of certain initiatives on product satisfaction and innovation (Keim 2001; Appelbaum et al. 2017).

Novel and interactive visualization types, such as those mentioned in the introduction (if used for their intended purpose and designed optimally), allow information to be uncovered which would otherwise stay hidden (Grammel et al. 2010). Currently, these new insights can be seen as a way to better attract customers or to optimize maintenance (Appelbaum et al. 2017), however, in the near future generating insight form Big Data will be a necessity to stay competitive (Perkhofer et al. 2019b). Nonetheless, in order to generate insight, the users and their abilities as well as needs have to be considered in the process of selection and design (Perkhofer 2019; Endert et al. 2014). While targeting specific users or user groups has already been identified as an essential part in standardized or scientific visualization use (Yigitbasioglu and Velcu 2012; Speier 2006), unfortunately, researchers and developers working on Big Data visualizations still put their sole focus on the generation of new visualization options to present a holistic view on the dataset (Perkhofer 2019). In doing so, they often fail to consider the users’ precise needs and risk for their visualizations to misinform users or for not being used at all (Isenberg et al. 2013; van Wijk 2005; Perkhofer et al. 2019b).

In order to create or select appropriate visualizations, three stages are crucial according to Brehmer and Munzner: (1) encode (select and design appropriate visual forms), (2) manipulate (enable the user to interact with the data), and (3) introduce (enable the user to add additional data and save results) (Brehmer and Munzner 2013). In this paper, we concentrate on the selection and the design of appropriate, and most importantly interactive visualizations and therefore put emphasis only on the first two stages.

2.1 Encode: choosing the visual representation and design

Visual representation is synonymous with visual encoding and means transforming data into pictures. Analyzing data through a visual inference is easier and cognitively less demanding than looking at raw data because it allows for the identification of proximity, similarity, continuity, and closure (Zhou and Feiner 1998; Ohlert and Weißenberger 2015). Before we evaluate Big Data visualizations and their influencing factors based on usability (ISO 9241), a classification on the multiple choices proposed and presented to potential users (in literature and free libraries such as D3.js^{Footnote 1}) is necessary. We limit our investigation to frequently-cited and open-sourced visualization options (Perkhofer et al. 2019a, b) as the evaluation of all Big Data visualization options goes beyond the scope of this paper.

2.1.1 Classification and description of frequently used Big Data visualizations

For classification, we distinguish between two features: the type of data that can be represented in the proposed visualization (1a. multiple dimensions but only one attribute^{Footnote 2} → hierarchical visualization vs. 1b. multiple attributes but only one dimension → multi-attribute visualization) and the basic layout (2a. Polar or 2b. Cartesian-coordinates based visualizations). A summary on the identified visualizations used, is presented in Table 1 (Please note that the table does not claim to be exhaustive, but should rather be seen as an indicator of frequently-used or proposed visualization methods for multidimensional datasets, which is the selection criteria for our empirical analysis).

Based on this summary of highly cited and used visualization types we can conclude that both, a mix of Polar and Cartesian-coordinates based visualizations as well as a mix of hierarchical and multi-attribute based ones, are common. From this pool of options, we picked the most frequently cited pair of each category for comparison. For a better understanding of each individual visualization type, they are going to be explained in more detail in the following:

The sunburst visualization (Polar-coordinates based layout and hierarchical data structure): The sunburst visualization is one of the more frequently used visualization types compared to other and newer forms of visualizations (Perkhofer et al. 2019b). It projects the multiple dimensions of the dataset in a hierarchical dependent manner into rings and can therefore be mapped to be a Polar-coordinates based visualization option. The sunburst is a compact and space-filling presentation and shows the respective proportion of the total value by each dimensions and its sub-components by its angular size (Rodden 2014). Due to the strict structure of a sunburst, the innermost ring represents the highest hierarchical level and all dimensions dependent on it are represented in further rings to the outside (Keim et al. 2006). The position of the rings influences interpretation and therefore re-positioning of these dimensions (using another sequence of dimensions for the display of the rings) means gaining other and new valuable insights. Additionally, based on the Vega-light specification, categorical color scales are used to encode discrete data values, each representing a distinct category and sequential single-hue schemes to visually link related data characteristics (Satyanarayan et al. 2017).

The Sankey visualization (Cartesian-coordinates based layout and hierarchical data structure): The Sankey chart focuses on sequences of data, which can either be time-related or dependent on a hierarchical structure (Hofer et al. 2018). It is often used to present analysis based on elections, as it allows to highlight populations remaining loyal to the same party as well as populations changing their vote from one election to the other. Thus, the Sankey visualization is designed to present information of flow between multiple dimensions (e.g. processes, entities,…) (Chou et al. 2016). With regard to storytelling and sensemaking, interactions like re-ordering (changing the sequence of dimensions) and reducing the amount of visible nodes to minimize visual clutter are indispensable (Chou et al. 2016). In addition, for a consistent analysis of the data, it is necessary to find a way to highlight information across nodes by making use of selectors (Hofer et al. 2018).

The parallel coordinates plot (Cartesian-coordinates based layout and multi-attribute data structure): The parallel coordinates plot is a very popular and strongly recommended visualization in the InfoVis (Information Visualization) community and highly cited in scientific research. This is due to the fact, that the parallel coordinates plot is one of the few visualizations that is able to present multiple attributes in one chart (Hofer et al. 2018; Perkhofer et al. 2019a). Two or more horizontal dimension axes are connected via polygonal lines at the height of the respective dimension value (Keim 2002; Perkhofer et al. 2019c). To do so, data is geometrically transformed (Keim 2001) and each line represents one data entry (e.g. an order, a sales entry). With respect to interpretation, Inselberg introduced common rules for the identification of correlations and trends (Inselberg and Dimsdale 1990)

lines, which are parallel to each other suggest a positive correlation,
lines crossing in an X-shape suggest a negative correlation, and
lines crossing randomly, show no particular relationship.

Similar to a Sankey visualization, a user has to be able to re-arrange the axes on demand as only neighboring axis can be interpreted in a meaningful way (Perkhofer et al. 2019a). By making use of both, categorical/sequential single-hue color scales as well as filtering options, cluster analysis can be performed (Perkhofer et al. 2019c).

The polar coordinates plot (Polar-coordinates based layout and multi-attribute data structure): The polar coordinates plot is a radial projection of a parallel coordinates plot (Diehl et al. 2010). Attributes are arranged radially, and each attribute value is presented proportionally to the magnitude of the value of each attribute with respect to their minimum and maximum value. Each line connecting the attribute values represents one data entry. Characteristic for a polar coordinates plot is the detection of dissimilarities and outliers. Nonetheless, it is difficult to compare the lengths across the uncommon axes (Diehl et al. 2010). Users encoding a polar coordinates plot try to interpret the area that appears as soon as all attributes are connected. Unfortunately, areas that appear at random depending on the loosely selected order of attributes, misinform the user. Further, areas are more difficult to compare than straight lines connecting data points (Kim and Draper 2014) and data points in outer layers cause areas to appear disproportionately bigger and therefore, angles lead to a harder assessment than straight lines due to their distortion (Perkhofer et al. 2019a). Effects have not been tested yet (Albo et al. 2016).

2.1.2 Possible factors influencing usability of Big Data visualizations

After presenting the most frequently used visualization options for Big Data, we are going to discuss possible influencing factors on their ability to encode specific information and making them accessible to their users. As already explained, each visualization type has the potential to uncover and present a different type of insight to its audience (supporting a different task), while at the same time hiding another (Perkhofer 2019). As theories and experimental research on the process of encoding for interactive visualizations for Big Data are limited, research from standard business graphics and dashboarding are used for hypotheses development (Falschlunger et al. 2016a; Speier 2006; Vessey and Galletta 1991; Perkhofer 2019; Yigitbasioglu and Velcu 2012). The purpose of this approach is to test existing principles on their applicability on interactive and new forms of visualizations and to shed light on the process of encoding in order to foster decision-making.

Previous findings have shown that the following factors (explained in more detail below) influence the ability of the user to successfully decode information given a chosen visualization option (Perkhofer 2019; Falschlunger et al. 2016a, b; Ware 2012; Vessey and Galletta 1991; Speier 2006):

(1)
the design of the visualization,
(2)
the dataset,
(3)
the task, and
(4)
the decision-maker characteristics (in particular previous experience and knowledge on reading and interpreting visualizations).

With respect to the design of the visualization, it has been shown that a low data-ink ratio (Tufte 1983) and the display of coherent information in juxtaposition (Perkhofer 2019) allows for a faster processing of information. Both of these principles are satisfied by Big Data visualizations, as they are designed to visualize the full dataset within one coherent visualization. However, a need for discussion can be identified when choosing a basic layout as visualizations are either based on a polar-coordinate or Cartesian-coordinate system and the basic layout fundamentally changes the way information needs to be decoded by the user (Rodden 2014). While in a polar-coordinates based visualizations, angles need to be assessed, the height of a column or the length of a line that needs to be compared within a Coordinates-based system. With respect to standardized business charts, Cartesian-coordinates based visualizations (scatterplots, line and bar charts) are known to outperform polar-coordinates ones (pie charts) (Diehl et al. 2010). However, this result on the most appropriate layout needs to be re-evaluated for Big Data visualizations as interactivity might change results (Albo et al. 2016). Further, the share of polar-based visualizations within the available and applied visualization tools is quite large and therefore deserves a closer look. This leads to our first hypotheses:

H1a: The basic layout influences usability of a visualization.
H1b: Cartesian-coordinate based visualization types outperform polar-coordinate based visualization types.

Next to the design, the underlying dataset influences usability. It is known, that data can only be assessed, as long as enough cognitive space is available for data processing (Sweller 2010; Atkinson and Shiffring 1968; Miller 1956). Otherwise, or more precisely in a state of information overload, a negative effect on effectivity, efficiency, and satisfaction can be identified (Bawden and Robinson 2009; Falschlunger et al. 2016a). It has also been demonstrated that data, which is presented in a familiar form (e.g. known since childhood) or which can be related to already known information stored in long-term memory, is processed faster and more accurately as the burden it poses on working memory is reduced (Perkhofer and Lehner 2019; Atkinson and Shiffring 1968).

As presented in Table 1, one needs to distinguish between hierarchical and multi-attribute visualization types when dealing with multidimensional datasets. While for hierarchical visualizations only one attribute (e.g. one KPI such as sales) needs to be evaluated based on different levels and compositions of aggregation, for multi-attribute visualizations multiple attributes need to be processed. For the latter, not only different measures need to be known and understood, but also they have to be analyzed in reference to each other for new insights to appear. Consequently, multi-attribute visualizations are said to enhance the burden placed on the user (Falschlunger et al. 2016a) leading to the following hypotheses for our investigation:

H2a: The underlying dataset influences the usability of a visualization.
H2b: Hierarchy based visualizations types outperform multi-attribute based visualization types.

Without question and as already mentioned multiple times, tasks and insights differ with different visualization types. Matching the visualization to its respective task has been identified as a main influence in traditional visualization use. It has been shown that a mismatch increases cognitive load and consequently impairs decision-making outcome (Falschlunger et al. 2016a, b; Dilla et al. 2010; Shaft and Vessey 2006; Speier 2006; Perkhofer 2019). Up to now, the question of tables versus charts has been extensively tested resulting in a classification of spatial tasks (looking for trends, comparisons etc.) to be best supported by spatial visualizations (charts) and symbolic tasks (looking for specific data points) to be best supported by symbolic visualizations (tables) (Vessey and Galletta 1991). With respect to Big Data visualizations, a new classification of tasks has been established, namely identify (search for a specific data point), compare (compare two different data points or also two different aggregation levels), and summarize (generate overall insights by looking at the whole dataset) (Brehmer and Munzner 2013). However, these tasks have not yet been associated with visualization types or characteristics of visualization types.

Based on the fundamental activities that users have to perform and given the above presented task type classification, we hypothesize that the task identify is easier to perform if no previous aggregation based on different dimensions has influenced the visual appearance of the dataset. On the other hand, summarize should be easier to accomplish, if the dataset has already (at least to some extent) been aggregated and not every single data point is displayed in isolation. With respect to compare, results will be better for single data comparison tasks if the display shows every data-point in isolation, while results on the comparison of sub-dimensions (already aggregated data) will be better in hierarchical visualizations.

H3a: The task type influences the usability of a visualization.
H3b: Users will perform better with a multi-attribute visualization than with a hierarchy-based visualization when confronted with the task type identify.
H3c: Users will perform better with a hierarchy-based visualization than with a multi-attribute visualization when confronted with the task type summarize.

And finally, yet important, the factor user characteristics needs to be considered when choosing a specific visualization type. Results on standard business charts have demonstrated that choosing the right chart type (bar, line, pie, or column) has resulted in contradicting results given different user groups (Falschlunger et al. 2016a). Only if the factor previous experience, not only with the dataset and the KPIs but also with the respective visualization type used, is considered and included into the selection process, satisfying results are the consequence (Falschlunger et al. 2016b; Perkhofer and Lehner 2019). This can again be explained by the use of cognitive load theory: If the user has never seen the layout before a lot of cognitive resources are needed to read the visualization rather than to interpret the information that is represented by it. The more experience a user has with a specific visualization the more reading strategies exist and the more automated is the process of extracting data from the visualization leaving ample room for data interpretation (Perkhofer 2019).

H4: Previous experience/usage of the different visualization types positively effects usability.

Based on these findings, the different visualization options for representing multidimensional data presented and described above will most likely result in differences on usability depending on the task (identify, compare, summarize), the dataset used (hierarchical or multi-attribute data), their basic layout they represent (Cartesian-based or polar-based layout), and the level of previous experience. Nonetheless, we hope to find general rules and guidelines, similar to those of standard and scientific visualization use, to guide designers and users.

2.2 Manipulate: using interaction to manipulate existing elements

Visualizations designed to present large amounts of data greatly benefit from the process of interaction. In particular, the following processes are better supported by interactive visualizations when confronted with visual analytics tasks: detecting the expected, discovering the unexpected (generate hypotheses), and drawing data-supported conclusions (reject or verify hypothesis) (Kehrer and Hauser 2013). To be more specific, working with interactive visualizations is driven by a particular analytics task. However, when working interactively, analysis does not end by finding a proper answer to the initial task, but rather allows the generation and verification of additional and different hypotheses, which are then called insight (Pike et al. 2009). These are generated only because the user interactively works with the dataset, and the process of doing so increases engagement, opportunity, and creativity (Brehmer and Munzner 2013; Dilla et al. 2010).

As a consequence, visualizations presented in Table 1 are claimed to only become useful as soon as the user is able to interact with the data. Interaction is of such high importance, because the actions of a user demonstrate the “discourse the user has with his or her information, prior knowledge, colleagues and environment” (Pike et al. 2009, p. 273). Further, the sequence of actions is not predefined but rather individual and dependent on the user. It thereby particularly supports the user’s knowledge base and perceptual abilities (Dilla et al. 2010; Brehmer and Munzner 2013; Elmqvist et al. 2011; Dörk et al. 2008; Liu et al. 2017). Consequently, interaction requires active physical and mental engagement and throughout this process, understanding is increased and decision-making capabilities are enhanced (Pike et al. 2009; Pretorius and van Wijk 2005; van Wijk 2005; Wilkinson 2005; Dix and Ellis 1998; Buja et al. 1996; Shneiderman 1996). In a static form, only a general overview is presented to the user, however, without the opportunity of interaction, hypotheses verification, or further hypotheses generation is extremely limited (Hofer et al. 2018; Liu et al. 2017; Pike et al. 2009; Perkhofer et al. 2019b). This not only frustrates users, but also contradicts the well-known mantra of visual information seeking: “overview first, zoom and filter, then details on demand” (Shneiderman 1996). On a more practical level, interaction allows the user to filter, select, navigate, arrange, or change either the amount of data or the characteristics of the visual display (for details on the interaction techniques see Table 20 in the “Appendix”).

Results on the proper use of interaction techniques are limited. Unfortunately, studies that have been conducted up to now, tend to blur the concept of visualization in combination with interaction (Pike et al. 2009). However, existing recommendations predominantly support the use of multiple interaction methods (Rodden 2014). “The more ways a user can ‘hold’ their data (by changing their form or exploring them from different angles and via different transformations), the more insight will accumulate” (Pike et al. 2009, p. 264). By intentionally clicking, scrolling and filtering the data, the user gains a deeper understanding of the relations within the given dataset. Interaction is therefore an essential part of the sense-making process and enhances the user’s processing and sense-making capabilities (Shneiderman 1996). Building on previous literature, the following hypotheses are presented:

H5a: Interaction influences the usability of a visualization.
H5b: Users will perform better with a highly interactive visualization than with a mostly static one.

3 Study design

The purpose of this paper was directed toward the identification of an interactive visualization in order to present multidimensional data effectively and efficiently. Therefore a within and between experimental design (4 × 3 × 2) was used. The visualization type was manipulated at four levels: For the experiment, we chose the most frequently researched and available visualization types, the sunburst visualization, the Sankey visualization, the parallel coordinates plot and the polar coordinates plot (please see Table 1). Two of the visualization types under investigation are in a Polar-based layout and two in a Cartesian-based one. Further, two out of four show a hierarchical dataset while the other two present a multi-attribute one. The task type was manipulated at three levels: The statements were based on Brehmer and Munzner’s task taxonomy—identify, compare and summarize (Brehmer and Munzner 2013). And finally, interaction was manipulated at two levels (limited interaction, high interaction).

The experimental study was conducted using LimeSurvey and the crowdsourcing platform Amazon Mechanical Turk (MTurk). For each visualization type, a separate but identical experiment was created. Each participant evaluated only one visualization type, but had to assess various statements to simulate the process of hypothesis verification (= task types). Visualizations were coded based on the D3.js library, extended, and adjusted to fit our purpose (most significant changes needed to be implemented with respect to interaction techniques; available visualizations had limited options). Visualizations are available for download on the author’s homepage or they can be accessed by clicking on the link presented in Table 22. Specifications on the dataset, the tasks, and the visualizations used are presented in the following subsections and the research model is presented in the following Fig. 1.

3.1 Data sample

We used a self-generated data sample for our study as a basis to compare the different visualization types. The dataset simulated a wine trade company and consisted of 9961 records, whereby each record represented a customer’s order. During construction of the sample, six finance experts designed key metrics typically used in trade companies to simulate a close-to-reality example for data exploration. The dataset consisted of 14 dimensions (order number, trader, grape variety, winemaker, state, etc.) and 12 attributes (gross margin, net margin, gross sales, net sales, discounts, gross profit, shipping costs, etc.) in total. As a result, our dataset can be described as being structured and shows no inconsistencies or missing values. Users were confronted with a large amount of data shown within one visualization, including multiple possible dimensions and attributes in order to find patterns, trends, or outliers. This allowed the assumption that confusion and misunderstanding based on the dataset were kept to a minimum (also confirmed in pre-tests). Each visualization used, without any filters active, showed the complete underlying dataset of 9961 records.

3.2 Manipulation of the independent variables

As already explained in Sect. 2, we tested four distinct visualization types. These four visualization types could be characterized by two features: by the structure of the data they were capable to display (hierarchical data vs. multi-attribute data) and by the overall layout of the visualization types (horizontal/Cartesian vs. radial/Polar). Additionally, interaction is the central component to understand and work with Big Data visualization tools. Therefore, by taking a closer look at already existing prototypes and literature, two interaction concepts per visualization type were designed to establish comparison and fairness, but also present each type in the best possible and most natural way. The used and virtually available visualization types and their respective interaction concepts are presented in Tables 21 and 22 in the “Appendix”.

Based on the previously described dataset, statements in accordance with Brehmer and Munzner’s task classification model for Big Data visualizations were created and presented to the participants for evaluation in randomized order. Participants were asked in the experimental conditions to assess the statements’ truth (examples are presented in Table 2). Each task type was assessed twice per visualization-interaction combination.

Table 2 Task types used including their predefined answer options

Does design matter when visualizing Big Data? An empirical study to investigate the effect of visualization type and interaction use

Abstract

Similar content being viewed by others

1 Introduction

2 Theoretical background and hypotheses

2.1 Encode: choosing the visual representation and design

2.1.1 Classification and description of frequently used Big Data visualizations

2.1.2 Possible factors influencing usability of Big Data visualizations

2.2 Manipulate: using interaction to manipulate existing elements

3 Study design

3.1 Data sample

3.2 Manipulation of the independent variables

3.3 Dependent variable usability

3.4 Control variable

3.5 Procedure

3.6 Participants

4 Results

4.1 Results of interaction technique (per visualization type)

4.1.1 Descriptive statistics

4.1.2 MANCOVA for evaluating effectivity, efficiency, and satisfaction

4.1.3 ANCOVA for evaluating our sum score for usability

4.2 Results on task type (per interactive visualization type)

4.2.1 Descriptive statistics

4.2.2 MANCOVA for evaluating effectivity, efficiency and satisfaction

4.2.3 ANCOVA for evaluating our sum score for usability

5 Conclusion and future work

5.1 Discussion and implications

5.2 Limitations and further research

5.3 Concluding remarks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Appendix

Appendix

1.1 Randomization check 1: visualization type

1.2 Randomization check 2: interaction type

1.3 Randomization check 3: task type

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation