Spatial legend compatibility within versus between graphs in multiple graph comprehension

Riechelmann, Eva; Huestegge, Lynn

doi:10.3758/s13414-018-1484-0

Spatial legend compatibility within versus between graphs in multiple graph comprehension

Published: 01 February 2018

Volume 80, pages 1011–1022, (2018)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Spatial legend compatibility within versus between graphs in multiple graph comprehension

Download PDF

Eva Riechelmann¹ &
Lynn Huestegge¹

1219 Accesses
4 Citations
Explore all metrics

A Correction to this article was published on 19 March 2018

This article has been updated

Abstract

Previous research has shown that spatial compatibility between the data region and the legend of a graph is beneficial for comprehension. However, in multiple graphs, data–legend compatibility can come at the cost of spatial between-graph legend incompatibility. Here we aimed at determining which type of compatibility is most important for performance: global (legend–legend) compatibility between graphs, or local (data–legend) compatibility within graphs. Additionally, a baseline condition (incompatible) was included. Participants chose one out of several line graphs from a multiple panel as the answer to a data-related question. Compatibility type and the number of graphs per panel were varied. Whereas Experiment 1 involved simple graphs with only two lines/legend entries within each graph, Experiment 2 explored more complex graphs. The results indicated that compatibility speeds up comprehension, at least when a certain threshold of graph complexity is exceeded. Furthermore, we found evidence for an advantage of local over global data–legend compatibility under specific conditions. Taken together, the results further support the idea that compatibility principles strongly determine the ease of integration processes in graph comprehension and should thus be considered in multiple-panel design.

The Use of Artificial Intelligence in Writing Scientific Review Articles

Article Open access 16 January 2024

The simple view of reading and its broad types of reading difficulties

Article Open access 12 August 2023

Reading Comprehension and Reading Comprehension Difficulties

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The omnipresence of graphs

Graphs have become omnipresent across a wide range of contexts in our everyday life (Purchase, 2014; Shah, Freedman, & Vekiri, 2005; Zacks, Levy, Tversky, & Schiano, 2002). Especially in the scientific world, they are a key ingredient for disseminating statements in a compact and yet powerful way. If well designed, graphs are a convenient way to communicate data, with many advantages over textual presentation (Larkin & Simon, 1987). These advantages include portraying complex data and relationships in an easy and understandable way, reducing reading time by presenting key findings in a readily visible manner, and reducing the overall word count (Franzblau & Chung, 2012). Given the omnipresence and potential effectiveness of graph usage, it is unfortunate that some graphs do not live up to their potential due to poor construction.

Early graph design guidelines often relied on common sense, positing plausible principles without strong empirical evidence (Bertin, 1983; Schmid & Schmid, 1979; Tufte, 1983). Over the years, however, graph comprehension theories have been backed up by empirical data (e.g., Carpenter & Shah, 1998; Cleveland & McGill, 1984; Pinker, 1990), and subsequent research has empirically addressed specific aspects of graph design. This resulted in various empirically informed design guidelines (Franzblau & Chung, 2012; Hollands & Spence, 1998; Kosslyn, 1994, 2006; Kumar & Benbasat, 2004; Shah & Carpenter, 1995; Shah & Hoeffner, 2002; Wickens, Hollands, Banbury, & Parasuraman, 2013), including guidelines for specific scientific disciplines (see, e.g., American Psychological Association, 2010, for psychological research). Note that these guidelines have mainly focused on the design of single graphs.

One example of a powerful display design principle is the well-known proximity compatibility principle (PCP) (Wickens & Carswell, 1995), which states that similarity (perceptual proximity) of graph elements fosters integration processes (processing proximity). The concept of similarity can refer to perceptual attributes (e.g., color and texture) as well as to absolute spatial position (Gillan, Wickens, Hollands, & Carswell, 1998; Wickens et al., 2013). The principle of compatibility is a concept related to the PCP. It is a common psychological principle with a long research tradition (e.g., Proctor & Vu, 2006), which is often referred to when designing effective graphs (Kosslyn, 2006) in terms of its benefits for performance—reflected, for example, in decreased error rates and/or response times (e.g., Hommel & Prinz, 1997; Huestegge & Philipp, 2011). Compatibility, as we refer to it in the present study, describes the degree to which the components of different elements in a display (i.e., stimulus–stimulus [S–S] compatibility) are spatially interrelated (Fitts & Simon, 1952; Proctor & Vu, 2006).

Huestegge and Philipp (2011) addressed S–S compatibility in graph comprehension by investigating the influence of spatially compatible versus incompatible data–legend relations. A facilitation of graph comprehension in terms of faster response times (RTs) and higher accuracy for spatially compatible (vs. incompatible) data–legend relations was revealed when judging the correspondence of a graph with a previously displayed statement. For example, participants were faster when reading a graph in which the upper data line was black and the upper legend entry also referred to this black line. Furthermore, it has been shown that the compatibility effect scales up with increasing complexity, in terms of both data pattern complexity (i.e., stimuli depicting interactions instead of main effects) and visual graph complexity (i.e., high vs. low amounts of data depicted within a graph). With this finding, the authors extended the original claim of the PCP, since they demonstrated that display proximity can also refer to relative spatial proximity, in terms of the relative position of elements in the legend and data regions. This finding is informative for theories of graph comprehension.

Theories of graph comprehension

Among others (e.g., Cleveland & McGill, 1984; Kosslyn, 1989; Lohse, 1993; Pinker, 1990; Simkin & Hastie, 1987), Carpenter and Shah (1998) introduced an influential, empirically informed model of graph comprehension. They proposed a multicycle, three-stage processing model. Every cycle starts with a pattern recognition phase, devoted to the encoding of visual patterns by forming visual chunks. An interpretation phase involves retrieving and constructing qualitative and quantitative meaning from the chunk (e.g., associating an ascending line with increase), and finally, an integration phase relates these meanings to the semantic referents inferred from legend, labels, and titles. Thus, to facilitate the process of information integration, the comprehensibility of the legend (and/or label and title) should be maximized. The model’s assumption of a multicycle process was empirically supported by corresponding eye fixation patterns (i.e., frequent gaze transitions between elements of the graph). The importance of information integration processes for graph comprehension has been further supported by other studies (e.g., Ratwani, Trafton, & Boehm-Davis, 2008). Huestegge and Philipp (2011) addressed integration processes with special interest regarding the integration of elements of the data region and the legend by manipulating the spatial compatibility between these elements (spatially compatible vs. incompatible data–legend relations). Corresponding eyetracking data showed a decrease of gaze transitions between the data region and the legend in data–legend-compatible conditions, suggesting that data–legend compatibility facilitates integration processes in graph comprehension.

However, there is a substantial lack of empirically backed knowledge regarding the design of multiple panels. Multiple panels are widely used in all fields of science and refer to the combined presentation of several graphs showing (closely) related, yet different, sets of data (Wickens et al., 2013). Many research guidelines lack any recommendations for the design of multiple panels (e.g., American Psychological Association, 2010), or mention this issue only vaguely (e.g., Coghill & Garson, 2006). Given the obvious advantages and widespread use of multiple panels (Kosslyn, 2006), this is highly surprising. Hence, in the present study we aimed to focus on one specific open issue, namely spatial legend compatibility, that we consider relevant to maximizing the efficiency of information integration processes in multiple-panel graphs.

The present study

The central starting point of the present study was the consideration of two different, but equally plausible, design options for multiple-panel line graphs, with respect to compatibility within elements in the data and legend regions and between legends. First, optimizing each individual graph of the multiple panels (along the lines of Huestegge & Philipp, 2011), and thus applying the principle of data–legend compatibility (i.e., within-graph compatibility), would lead to local optimality, whereas global between-graph legend incompatibility might occur—given sufficient variability in the data presented (see Fig. 1a). Second, it is possible to achieve global (i.e., between-graph) legend compatibility, but at the cost of potential local data–legend incompatibility in several graphs of the multiple panels (see Fig. 1b). Where Kosslyn (2006) argued in favor of the concept of within-graph compatibility, Andre and Wickens (1992) considered global compatibility (albeit in the context of human–machine interface design) as an important design feature. The goal of the present study was to put the competing compatibility principles to an empirical test by comparing the effects of both kinds of compatibility (pitted against an incompatible baseline condition) on graph comprehension processes. Specifically, spatial compatibility was manipulated by changing the order of the legend entries relative to the data region.

We also addressed the issue of panel size (number of graphs in a panel: two vs. six) in the present experiments, because several studies have shown that spatial compatibility effects scale up with visual complexity (Huestegge & Philipp, 2011; Ratwani et al., 2008). These previous findings indicated that at a certain threshold of complexity, the likelihood of observing adverse effects of incompatibility is increased (Carpenter & Shah, 1998; Ratwani et al., 2008), probably due to the limits of working memory capacity.

Furthermore, the presentation order of the compatibility conditions (blocked vs. random) was varied for two reasons. First, it appears reasonable to assume that compatibility effects could become more pronounced when trials are presented in blocks as compared to a random presentation order, since in blocked conditions participants might learn to take advantage of the particular type of compatibility over the course of a block. Second, the blocked-design condition allows for a separate analysis of initial performance (i.e., performance in the first block of a particular compatibility condition, without having experienced any other compatibility conditions) versus total performance averaged across the experiment. We considered such an analysis of initial performance relevant because it best represents the rather spontaneous encounter with a single (nonchanging) type of graph design in everyday life. In contrast, we anticipated that the repeated processing of graphs with varying types of legend arrangements (in the random sequence, or across all blocks in the blocked sequence) might yield a special processing strategy in order to cope with all types of legend arrangements encountered throughout the experiment, eventually yielding potentially diluted effects of compatibility.

On the basis of the aforementioned research indicating that graph complexity plays a major role regarding the presence/size of spatial compatibility effects in graph comprehension, we additionally ran Experiment 2, in which the complexity within each graph was increased by using line graphs with four (vs. two in Exp. 1) lines/legend entries per graph. Thus, we examined visual complexity in our study in two ways: namely, regarding panel complexity (within each experiment) and regarding graph complexity (across experiments).

Taken together, we predicted that both between- and within-graph compatibility would facilitate graph processing, and that such compatibility effects should scale up with visual complexity. Thus, we expected to find more substantial evidence for compatibility effects in Experiment 2 than in Experiment 1, and a stronger compatibility effect for larger panels (containing six graphs) than for smaller panels (containing two graphs). Regarding the two types of compatibility, we reasoned that especially for larger panels, in which working memory limits (Baddeley, 1983; Baddeley & Hitch, 1974) should strongly constrain any integrated or parallel processing of multiple graphs due to the greater number of graphs to be processed, within-graph compatibility should yield better performance than between-graph legend compatibility. In within-graph-compatible arrangements, we assumed that graph readers would automatically generate expectations regarding the (spatial) data region layout when they encoded the legend, and that meeting these expectations might promote integration processes (Huestegge & Philipp, 2011).

Experiment 1

In Experiment 1, we examined legend compatibility in simple multiple panels—that is, panels in which each graph consisted of only two lines/legend entries. Every graph within the multiple panels consisted of two parts: the data region and the legend, which was placed at the center right of the data region. To manipulate compatibility, we varied the order of the legend entries. In within-graph-compatible multiple panels, the order of the graph lines corresponded to the order of the legend entries in each graph of the multiple panels. Between-graph compatibility was obtained by maintaining a constant legend order for every graph of the multiple panels. There was no spatial match—neither between data region and legend nor between several legends—in incompatible multiple panels.

Method

Participants

Twenty-five university students (20 women, five men; age range: 20–29 years; M = 23.25, SE = 0.51) participated in the experiment and received credit points. One participant was excluded due to low accuracy (see the Results and discussion section for details). All of the remaining 24 participants reported normal or corrected-to-normal vision and had basic prior experience with statistics (e.g., due to statistics classes, work as a research assistant, or in the context of writing an empirical thesis). They gave informed consent. A power analysis using G*Power (Faul, Erdfelder, Lang, & Buchner, 2007), based on the very large observed effect sizes (regarding compatibility effects on RTs in line graphs) in the study of Huestegge and Philipp (2011), indicated that a sample size of four participants was sufficient to observe a spatial compatibility effect (power = .95, α = .05). Nevertheless, we opted for a larger sample size of n = 12 for each group, since we could not be sure that the compatibility effects in the present study could be expected to be as large as those in Huestegge and Philipp’s study.

Stimuli

Each trial consisted of the simultaneous presentation of several graphs (generated with Microsoft Excel), together forming multiple panels (consisting of either two or six graphs of equal size; see Fig. 2). There was a horizontal distance of 3.1° of visual angle between two graphs and a vertical distance of 0.9° between each of the three graphs in the six-graph panel.

The size of each graph amounted to 9.2° × 8.1° of visual angle (width × height). All graphs were black-and-white line graphs consisting of two uncrossed black lines each. The graphs depicted main effects and/or interactions, with both types of effects being represented within each of the multiple panels. The data point markers were black or white circles. Each legend (1.8°–2.5° × 1.5° of visual angle, depending on legend’s content) was placed to the right of the data region (5.5°–6.3° × 5.7°, depending on the legend’s size) and contained two entries, each consisting of a data marker (black/white circle) and a verbal label. The spatial separation between legend and data amounted to 0.3°, and the title and the data region were separated by 0.6° (see Fig. 1a).

To increase generalizability, we generated graphs (identical in design) covering three different topics (fictional dependent variables). These variables were represented on the y-axis: namely screen viewing time, life satisfaction, and learning outcome. These measures were plotted as a function of three independent variables. First, the x-axis referred to a dichotomous variable and was more or less related to age, thereby following the recommendation of Wickens et al. (2013) to place a (quasi-)quantitative variable on the x-axis of line graphs. The two lines in the graph represented the second independent variable, defined through the legend. The third variable was also categorical and was presented above the data region, thus also representing a graph title (see Fig. 1a).

We designed two basic multiple-panel figures per each of the three topics: a two-graph multiple-panel figure, and a six-graph multiple-panel figure. These six basic multiple panels served as templates, and each multiple panel was designed with both spatially compatible (within-graph- and between-graph-compatible) and incompatible legends, resulting in 18 multiple panels. In between-graph-compatible multiple panels, the order of the black and white data point markers in the legend (e.g., black markers in upper position/white markers in lower position) was counterbalanced across trials.

Note that due to the restriction of depicting only two lines (i.e., graphs with two legend entries), it was not possible to create between-graph-compatible panels that did not involve some individual graphs with compatible data–legend arrangements (here, half of the graphs per panel). For example, when using a constant “black above white marker” legend, half of the graphs in the panel also contained “black above white line” data (otherwise, all data regions would have been arranged in the same manner, which would be unlikely in real data sets as well as an uninteresting case for the present research question). Thus, global (between-graph) compatibility here was characterized by the absence of consistent local (within-graph) compatibility, not by the absence of any local compatibility.

Each trial consisted of the simultaneous presentation of a question (white font on black background), extending over seven to ten text lines (10° horizontally), and the multiple panels (see Fig. 1a for an example). For every question, there was a single correct answer, which corresponded to (the title of) one specific graph of the multiple panels (e.g., “For what kind of learning support is the learning outcome for the subject mathematics higher than that for the subject English?”—correct answer: “extra tuition” for two-graph multiple panels; “For what kind of learning support in high school is the learning outcome for the subject English most superior to the learning outcome for the subject mathematics?”—correct answer: “tutoring in small groups” for six-graph multiple panels). Half of the questions (for both two-graph and six-graph multiple panels) were designed to ask for main effects, and the other half to ask for simple main effects (considering the two temporal categories of the x-axis each as reference points). The questions always relied on vertical spatial terms (e.g., higher/lower) to indicate the task. We generated four questions for each of the three topics (screen viewing time, life satisfaction, and learning outcome) and for each kind of multiple panel (two-graph and six-graph multiple panels), resulting in 24 questions in total. Each of the 24 questions was combined with the corresponding graph (regarding the topic and number of graphs within the multiple panels) in its three different compatibility condition versions, resulting in 72 experimental trials in total.

Apparatus, task, and procedure

The text and figures were presented centered on a 19-in. TFT screen (1,280 × 1,024 pixels) at a viewing distance of approximately 57 cm. A standard keyboard and a computer mouse were available as input devices for the participants. The experiment was run using the PsychoPy presentation software (Peirce, 2007).

Before the single-session experiment (about 30 min) started, participants read a visual instruction (white font on a black background) and underwent four practice trials to familiarize themselves with the task. Each trial started with a white fixation cross (0.5° × 0.5°) on a black screen, presented for 1 s and placed on the left side of the screen (see Fig. 3). After a 2-s black screen interval, the question and the multiple panel were presented simultaneously, with the question located at the position of the prior fixation cross and the multiple panel on the right side of the screen. With the onset of question and multiple panel, the mouse cursor appeared at the center of the multiple panel to ensure equal starting positions for each trial. Participants were asked to indicate (with a left mouse click) as quickly and accurately as possible the specific graph representing the correct answer. There was no implemented time limit for the answer. Each trial contained only one correct option to answer the question. After each click, performance feedback was provided for 1 s. The assignment of the correct answer to a position within the multiple panel was equally distributed across trials. For half of the participants, the trials were presented in a fully randomized order. For the other half, compatibility type (within-graph compatibility, between-graph compatibility, incompatible) was manipulated block-wise (six blocks altogether), with the block order being fully counterbalanced across participants.

Design and data analysis

Compatibility (incompatibility vs. between-graph compatibility vs. within-graph compatibility, within-subjects variable), number of graphs (two vs. six graphs within the multiple panels, within-subjects variable), and context (random vs. blocked presentation of compatibility groups, between-subjects variable) served as independent variables. Corresponding three-way mixed analyses of variance (ANOVAs; α = .05 throughout) were conducted to analyze performance (RTs and error rates). Additionally, we assessed initial performance on the first block of trials in the blocked-presentation group (see the introduction for details) with two-way ANOVAs, treating compatibility as a between-subjects variable. In the case of sphericity violations, Greenhouse–Geisser corrections were applied.

Results and discussion

Outliers were calculated on the basis of correct trials and defined as trials with exceedingly long RTs (three SDs above the mean per participant, corresponding to 20 trials in total across all participants). These outliers, together with practice trials, were omitted from further analysis. Additionally, participants with very high error rates (three SDs above the mean, amounting to a cutoff value of 17.41%) or who lacked above-chance performance in at least one cell of the design were excluded (corresponding to one participant). In both the initial performance analysis and the global performance analysis, participants on average performed significantly better (in terms of faster RTs and fewer errors) when the multiple panels consisted of two (vs. six) graphs, F(1, 9) = 43.27, p < .001, η_p² = .83, and F(1, 9) = 36.00, p < .001, η_p² = .80, respectively, for RTs and errors in initials performance analysis, and F(1, 22) = 276.56, p < .001, η_p² = .93, and F(1, 22) = 101.74, p < .001, η_p² = .82, respectively, for RTs and errors in global performance analysis.

Initial performance analysis

The two-way ANOVA with number of graphs as a within-subjects variable and compatibility as a group variable revealed a significant effect of compatibility neither on RTs, F(2, 9) = 2.60, p = .128, η_p² = .37, nor on error rates, F < 1. For both RTs and error rates, the number of graphs did not significantly interact with compatibility, both Fs < 1.

Global performance analysis

RTs were submitted to a three-way ANOVA with compatibility and number of graphs as within-subjects variables, and context as group variable. We observed no significant main effect of compatibility, F(2, 44) = 1.29, p = .285, η_p² = .06, and no significant main effect of context, F < 1 (see Fig. 4a). Furthermore, none of the two-way interactions were significant, all Fs < 1. However, the three-way interaction was significant, F(2, 44) = 3.49, p = .039, η_p² = .14, indicating that compatibility had a different effect in the blocked (vs. the random) context, especially in the six-graph condition. However, when we conducted pairwise post-hoc t tests between the compatibility conditions, none of the comparisons approached significance, all ps > .10.

The mean overall error rate amounted to 9.20% (SE = 0.58). The main effect of compatibility on error rates was not significant, F < 1 (see Fig. 4b). None of the interactions revealed significant effects—neither the interaction of compatibility and context nor that of number of graphs and context (both Fs < 1), nor the three-way interaction, F(2, 44) = 1.18, p = .316, η_p² = .07. Only the interaction of compatibility and number of graphs was marginally significant, F(2, 44) = 2.97, p = .062, η_p² = .09. We decided to follow up on this marginal interaction by computing pairwise comparisons between the compatibility conditions. In the two-graph condition, the within-graph-compatible condition differed significantly from the incompatible condition, p = .009, whereas none of the remaining comparisons (including those in the six-graph condition) approached significance, all ps > .10. These results suggest that (specifically within-graph) compatibility tended to reduce error rates in the two-graph condition, but there clearly was no such tendency in the six-graph condition.

In sum, Experiment 1 revealed the expected result that graph comprehension is quicker and more accurate when the number of graphs depicted in the multiple panels is low. However, there was only sparse evidence (in terms of a fewer errors for within-graph-compatible designs in the two-graph condition) for a beneficial effect of spatial legend compatibility.

Probably this lack of a clear performance advantage for compatible relative to incompatible graphs in Experiment 1 can be attributed to the fact that the individual graphs within each panel were all rather simple, consisting of only two lines and legend entries. A reason for not finding performance differences between the two compatibility conditions may be that in the between-graph-compatible condition, half of the individual graphs per panel were still data–legend compatible (due to the restriction of depicting only two lines; see the Method section), which may have reduced the actual difference in design between the two compatibility options.

The explanation that the lack of clear effects could have been due to the lack of individual graph complexity corresponds to previous findings showing strongly attenuated (or absent) data–legend compatibility effects in single-graph panels with simple (two-line) graphs, whereas much stronger effects emerged for more complex graphs consisting of more than two lines (Huestegge & Philipp, 2011). To explicitly test this explanation, we conducted Experiment 2, which involved more complex graphs consisting of four (instead of two) lines per graph.

Experiment 2

In Experiment 2, we focused on visually complex graphs (e.g., graphs depicting more data) by raising the number of lines and legend entries for each graph from two to four. On the basis of previous research suggesting that legend compatibility effects scale up with graph complexity, we reasoned that we should observe clearer compatibility effects in Experiment 2 than in Experiment 1, as well as stronger compatibility effects for large (vs. small) panels.