1 Introduction

Information presentations (e.g., presentation slides, business reports, and advertising posters) are widely used in daily life as an essential medium of visual communication. In an information presentation, the layout refers to the 2D spatial structure which is determined by the sizes and positions of the visual elements [1]. To ensure effective message delivery, the layout is usually arranged by a set of design guidelines or principles. For example, Williams Robin provided practical design guidelines (e.g., proximity, alignment, repetition, and contrast) to suggest the layout when creating an information presentation [2]. More theoretically, the Gestalt laws provide the fundamental theory to explain the mechanisms of how human beings perceive and understand visual information [3]. In a word, a well-designed layout structure enables the quick and easy perception of the information [4, 5].

Hierarchical grouping is a popular approach to explaining the perceptual structure of the layout, which is commonly used in professional graphic design tools (see Fig. 1 for an example). Here, a group is defined as a set of related visual elements that are associated with each other, and the hierarchical grouping is the collection of groups that are organized in a hierarchical structure. A proper hierarchical grouping can benefit many downstream use cases in visual design. A primary and direct use case is assisting users in designing information presentations easier: if the visual elements of an information presentation are grouped into a meaningful hierarchical structure, users can conveniently move/resize/rotate/transform/delete a group of related elements. It can also help to apply animation effects to the elements [6]. Furthermore, other potential use cases that rely on understanding the grouping structure include design assessment [7], visual search [8], information extraction [9], information seeking [10], and design space exploration [11].

Figure 1
figure 1

An example to show the grouping of information presented in two hierarchies (a) Original design; (b) Large grouping granularity; (c) Smaller groups; (d) The pane to view the hierarchical grouping in Powerpoint. High-level groups consist of low-level groups, which consist of visual elements. In each grouping hierarchy, visual elements in the same red box are in the same group

Although useful for downstream applications, most information presentations in the wild do not explicitly annotate the hierarchical grouping. To address this problem, we aim to automatically recover hierarchical grouping from layouts of visual elements in information presentations. It is a challenging task for two major reasons. First, unlike the spatial relationships between elements in the layout such as direction and distance, which can be computed by using pre-defined rules [12], the grouping structure is the implicit knowledge hidden in the layout. Second, there are many factors that may affect the grouping structure, which is hard for researchers to enumerate all the rules and principles. Several existing works have already paid attention to recovering grouping from arbitrary images [5] and visualizations [13], but not from layouts of visual elements in information presentations. One of the most related works [14] implements four Gestalt laws to produce grouping in web page design which performs well in their test cases. However, extensive human efforts are still required for researchers to determine better design rules to improve accuracy. Therefore, our emphasis in this paper is to incorporate a learning-based approach to a more intelligent grouping technique.

The large collection of information presentations available online serves as a great resource for our research. Therefore, we first constructed a large-scale dataset of information presentations. Specifically, as PowerPoint (denoted as PPT) is one of the most commonly used formats for information presentation in daily life, we have collected approximately 1000 PPT files from online public websites, which finally gave us 23,072 well-designed information presentations. Each information presentation contains approximately 27 visual elements on average. By parsing visual elements from these slides, we extracted typical types of elements, such as shape, text, picture, and chart. To assist in understanding the layout, we also parsed all spatial features of each visual element.

Next, we proposed an automatic approach to produce a hierarchical grouping of visual elements from information presentations. To address the problem by not enumerating all possible factors, we first incorporated a learning-based approach to model the relatedness between visual elements in information presentations. Specifically, we introduced a Transformer-based model, which can capture the contextual layout information to predict relatedness between visual elements by learning from human-labeled information presentations. In addition, we improved the original Transformer-based model by integrating the spatial-aware attention mechanism to involve the relative spatial information among elements. Second, our technique hierarchically groups all the elements from bottom to top according to the predicted relatedness. In detail, we defined a graph model of the information presentation layout, in which nodes are visual elements and edges are the adjacent connections between elements. Then, we built a graph-based algorithm to group the visual elements hierarchically.

To evaluate the technique, we conducted a technical evaluation to compare the accuracy results of our models with a baseline method. We found that our technique can correctly predict pairwise relatedness with high accuracy. Furthermore, we conducted a user study with thirty designers to evaluate the final hierarchical grouping results. The results show that our technique has comparable quality with the slides manually grouped by designers.

To summarize, we make the following three key contributions in this paper:

  1. 1)

    Dataset. We contribute a large-scale design dataset to the community. It includes 23,072 information presentations with 620,878 visual elements collected from the Internet. The dataset is freely available for academic purposes.

  2. 2)

    Technique. We present an automatic approach for recovering hierarchical grouping in information presentations. It consists of two major modules: a Transformer-based model for predicting relatedness between visual elements and a bottom-up algorithm for grouping the visual elements hierarchically.

  3. 3)

    Evaluation. We conduct a technical evaluation and a user study to evaluate our proposed technique. The results show that our technique can produce accurate hierarchical grouping results.

2 Background and related work

This section provides three categories of related work: graphic design layout, grouping of information, and data construction for graphic design.

2.1 Graphic design layout

To create effective and aesthetic information presentations, designers need to put significant effort into designing the layout [4]. Therefore, an extensive range of techniques has been proposed to assist layout design. Some studies in the community developed automatic layout algorithms that can place graphical and textual elements in proper positions based on certain design rules [1]. For example, DesignScape [15] is an interactive system that aids the design process by interactively generating layout suggestions, such as position, size, and alignment. SketchPlore [16] uses a real-time layout optimizer that automatically infers the designer’s task to search for both local design improvements and global layout alternatives. As the design of information presentation also commonly uses grid layouts to define spatial structures using grid lines [17], a more recent work GRIDS [18] uses a mixed-integer linear programming model for grid layout generation that optimizes the packing, alignment, grouping, and preferential positioning of elements. However, these rule-based methods typically aim to optimize certain objectives, which may ignore other aspects of the design. Recent advances in deep learning enable advanced techniques to generate graphic layouts by learning existing works. Zheng et al. [19] proposed a deep generative model to synthesize layouts based on the semantic meaning of visual elements. Vinci [20] uses a sequence-to-sequence generative model to match the visual elements and layouts for generating an aesthetic advertising poster. LayoutGAN [21] synthesizes layouts from different types of 2D visual elements by using a generative adversarial network (GAN) architecture. While these techniques are powerful in generating information presentation, they usually ignore the stage of understanding the layouts exactly.

Other works in the community focus on understanding the layout in the graphic design. For instance, Bylinskii et al. [22] presented a machine learning model that predicts the relative importance of different visual elements in graphic designs. Because inferring spatial relationships among elements plays a vital role in layout suggestion, Xu et al. [23] designed an interactive user interface for visualizing two types of spatial relationships: edge alignment and equal spacing. O’Donovan et al. [24] proposed an energy-based method to learn three hidden variables in single-page layouts, including alignment, the importance of each element, and grid-based segmentation. These works are relevant to our work, which also extract the layout-related information from the design of information presentations. However, unlike these previous works, we aim to focus on hierarchical grouping, which is the implicit knowledge hidden in the layout of visual elements. In this work, we take advantage of a deep-learning-based approach to reverse-engineer information presentations, especially recovering the hierarchical grouping from layouts.

2.2 Grouping of visual information

Although layouts might be vary specific to different document types, such as presentation slides, web pages, and advertising posters, the grouping of information is general to organize the content of information presentations. A proper grouping can improve the readability and highlight relatedness between the information [25]. In practice, the proximity guideline proposed by [2] states “Group related items together”. Designers can move spatially closed visual elements to make them a cohesive group rather than unrelated items. In other words, spatial closeness usually implies a high relatedness between the elements. In addition, the Gestalt principle states that human beings generally organize information using a top-down approach [25, 26]. This principle suggests that a designer should organize visual information in a clear hierarchical grouping structure.

Automatic grouping methods are extensively utilized in image processing. Ren et al. [27] used the features derived from the classical Gestalt cues to group pixels into regions. Wattenberg et al. [13] utilized machine vision techniques for analyzing an image at multiple resolutions to describe the hierarchical segmentation of an image. Several bottom-up approaches were proposed for hierarchically grouping pixels in images or videos [2830]. Recently, advanced deep learning methods have also been proposed to provide a range of end-to-end solutions for pixel-level grouping [3133]. Nevertheless, these grouping techniques in image processing are all based on pixels. In contrast, grouping in information presentation focuses on the visual elements in a 2D view, which have more diverse spatial features.

There are several relevant grouping techniques for graphic design. Rosenholtz et al. [5] utilized Gestalt laws to create a simple model for perceptual grouping for arbitrary images. Zhen et al. [34] established a model to merge web page content into semantic blocks by simulating human perception. Yu et al. [35] proposed a process to analyze the structure of a web page and organize the web elements into a two-level hierarchy. However, these web-oriented grouping techniques commonly require a formal HTML structure. Recently in infographic design, InfoVIF [11] analyzed the presentation flow of static infographics and extracted the visual groups using Gestalt principles of proximity and similarity. InfoMotion [6] extends InfoVIF to reconstruct visual groups using a clustering algorithm. However, these two recent techniques use carefully designed rules to pick elements into a visual group, which may ignore other design aspects. In addition, they both focus on the flow structure extraction but not the hierarchy in the grouping of information. One similar work introduced a computational method to group DOM elements in web pages [14]. Nevertheless, they manually defined four objective functions that are hard to extend and improve. Another similar work by O’Donovan et al. [24] proposed a region-based hierarchical segmentation. The algorithm takes graphic layouts as input and then cuts the region into two sub-regions iteratively. This approach requires the design to follow a grid layout, and there should be non-overlapping visual elements. Compared to these works, our approach does not have the constraints for manually defined rules or strict requirements for the layout.

2.3 Data collection for layout design

To support research on layout design in information presentations, a range of works construct datasets by collecting designs from public online websites. In the data visualization community, Chen et al. [36] built a dataset containing 360 images of multiple visualizations collected from academic publications and made annotations of layouts. InfoVIF [11] provides a large dataset of 13,245 infographics in image format created to explore the visual information structure in infographics. Deka et al. [37] presented Rico, the largest repository of mobile app designs by crowdsourcing, created to support data-driven applications such as user interface (UI) layout generation. Webzeitgeist [38] is another design repository of over 100,000 web pages with the statistical description of layouts. To support layout generation, Zheng et al. [19] collected a corpus of 3919 online magazine pages with rich annotations, including categories, semantic layouts, and textual summaries. Compared with these studies, we aim to construct a large-scale layout dataset that contains rich spatial features to understand the layout structure. Therefore, we collected 23,072 well-designed information presentations from public websites. We parsed all the related spatial features of each visual element.

3 Expert interviews

To help us understand the hierarchical grouping used in the design, we conducted a preliminary interview with ten graphic designers (eight females, aged 23-34). These designers have 6.3 years of design experience on average (from 2 to 16 years) and have professional training in design schools. They are all familiar with designing information presentations. During the interview, we guided them to explain their design experience with questions under two primary topics: benefits from hierarchical grouping and criteria for hierarchical grouping. We summarize their feedback as follows:

1) Benefits from hierarchical grouping

As most professional graphic design software (i.e., Adobe Illustrator, Sketch) supports the hierarchical grouping, all the interviewees are familiar with this function. According to our interviewees, there are two reasons why they use hierarchical grouping in their information presentation files. The first major reason (mentioned by eight interviewees) is efficiently supporting layout design. For example, if a designer wants to align two groups of items, he or she can easily select one group at the proper level of hierarchy and then align it to another group. Otherwise, he or she has to operate each item in the group, which can significantly reduce the efficiency. As another example, one expert in the interview said: “I often need to create web UI designs with various sizes to accommodate to different screen sizes, such as mobile, tablet, and laptop. A proper hierarchical grouping can save me lots of time to revise the layout”. Second, four interviewees stated that hierarchical grouping can clearly segment a graphic design, assisting collaboration with other designers on the same document. On the other hand, all the designers agreed that a design without a proper hierarchical grouping may cause trouble. Specifically, one designer commented that “Without proper grouping, it is more possible to make mistakes when adjusting the layout and users have to undo the operation.” The designers also stated that when reusing the design files in the wild, they have to manually adjust the hierarchical grouping of the visual elements before editing, which may burden their design process.

2) Criteria for hierarchical grouping

When discussing the best practice of grouping, designers mentioned several criteria. First, the items in a cohesive unit should have a high probability of being grouped together. For the example of Fig. 1(a), the item in the list, which consists of an icon, a title, and a description to convey a single message, should be grouped together because these elements form a highly cohesive unit. Second, designers agreed among themselves on the significance of spatial distance when grouping elements. Elements relating to each other should be grouped close together. Designers also admitted that they only considered grouping elements in the neighborhood in most situations. Finally, the grouping was suggested to follow a visual hierarchy, which plays a key role in the planning of the layout to help viewers navigate through the information presentation more easily.

Design requirements. Our goal is to propose an automatic approach to recovering hierarchical grouping from information presentations. To achieve the goal, we summarize two design requirements based on the feedback gathered from the interviews with designers.

R1 Capturing relatedness. The first requirement is to capture relatedness between visual elements, which can be affected by many factors, including the types and spatial features. The technique should be intelligent enough to predict relatedness by considering as much information as possible.

R2 Constructing a hierarchical structure. The grouping result should be a hierarchical tree structure according to the visual hierarchy. A parent group can contain multiple sub-groups. However, a sub-group cannot be across multiple parent groups. In this way, the exact group can be easily accessed from its parent group.

To fulfill these requirements, the design of our technique consists of two major modules (1) a deep learning model for predicting pairwise relatedness (Sect. 5) and (2) a graph-based algorithm for grouping (Sect. 6). In the first module, the deep learning model can understand the relatedness between visual elements by learning from labeled training data (R1). In the second module, we model the layout as a graph model to represent the adjacent connection. Then, we propose a graph-based algorithm to group the visual elements from bottom to top hierarchically (R2).

4 Collection of information presentations

To understand information presentations, we construct a large-scale dataset of information presentations with diverse layouts from the Internet. First, we describe the procedure to collect the data. Then, we analyze the statistical results and extract the spatial features.

4.1 Collection procedure

We chose PowerPoint files as the target file format to construct our dataset because they are the most commonly used information presentations. Each slide in the PPT file is an information presentation that consists of a set of graphical and textual elements. Furthermore, PPT files contain the actual spatial information for each visual element, which can be parsed by an open-source tool.Footnote 1 Therefore, we can obtain accurate positions of visual elements from information presentations without using any computer-vision methods such as InfoVIF [11].

To ensure the design quality and diversity, we collected PPT files through two key steps: (1) searching a large number of candidates in the wild and (2) selecting proper slides to include in our dataset.

In the first step, we searched for public websites such as SlideShareFootnote 2 that are used for sharing slides. We also searched related keywords such as “slide template”, “marketing slides” and “presentation design” in popular search engines to collect open-source PPT files. We included as many slides as possible in this step. At the end of this step, we downloaded approximately 1000 PPTs in total, including more than 40,000 information presentations.

In the second step, we selected infographics-related information presentations from these candidates to construct our dataset. To ensure that the layouts of information presentations are diverse and qualified, we define the following four criteria to filter the slides.

C1 To ensure that visual elements of an information presentation are worth grouping, we ask that slides should have more than three visual elements. As the counterexample is shown in Fig. 2(a), it only includes an icon shape and two textual objects. Too few elements are not valuable to be grouped.

Figure 2
figure 2

An illustration of four negative cases that we removed from the dataset: (a) includes only three visual elements (one icon and two textual objects); (b) has no graphical elements; (c) just contains one visual group; (d) has a disorder layout. (e) and (f) are two positive cases in the dataset

C2 To ensure the diversity of layouts, we ask that slides should contain at least two visual groups. As we can see in the counterexample in Fig. 2(b), which is a title page, the four elements (a rectangle, an icon shape, and two textual objects) form a coherent group to show a single message.

C3 To ensure an information presentation is visually attractive and informative, we ask that slides should include both graphical and textual elements. The counterexample in Fig. 2(c) presents a slide with only text information.

C4 The elements in the slide should be arranged in a well-organized structure. It also means the elements should form an intuitively regular pattern [11]. In Fig. 2(d), a counterexample shows a layout whose elements are disordered to some extent.

We wrote a simple program to filter out invalid slides (C1 and C3). As C2 and C4 are related to subjective judgment, human effort is involved in assessing whether a slide is qualified. Two authors of this paper manually filtered these slides. Both of them are experts in designing information presentations. One has studied in a design school for three years, and the other has worked on information presentation for over five years. Each slide takes one author approximately 5 seconds to decide whether to keep it or remove it. They kept the cases they were unsure about and met every three days to discuss these unsure cases and achieve consensus. The whole process took seven days (about six hours per day) to look through all the slides. Figure 3(e) and Fig. 3(f) are two typical information presentations included in our dataset. They have diverse visual elements and clear layout structures.

Figure 3
figure 3

Summary statistics. (a) The distribution of element count per slide; (b) The distribution of types in the dataset; (c) Spatial distribution of the seven element types on the normalized space. The color of a region indicates the count of the element type appearing in the corresponding region

4.2 Results

We finally collected a total of 23,072 slides with 620,878 visual elements. The average number in each slide is 26.9. Figure 3(a) presents the distribution of the number of elements in our data collection. The majority of the slides (83.7%) contain less than 40 elements. More than half of the slides have 10 to 30 visual elements (57.3%).

We encode the layout structure of each slide with different types of visual elements. In total, there are seven major types we extracted from our collection: preset geometry, freeform shape, text, picture, line, chart, and table. The distribution of element types is displayed in Fig. 3(b). A preset geometry (216,300) is a typical shape provided by general design tools, such as a rectangle, a circle or a star. In detail, the most commonly used preset geometry is rectangle (119,818), and the second is oval (56,560). A freeform shape (192,673) is a user-defined shape that is composed of multiple paths. These kinds of elements are usually icons or embellishments in the design, which might have a related semantic meaning for the design. A text (185,259) is a text box that enables users to input text information. It is the most important component that is responsible for conveying messages in information presentations. The other four types are not the majority of the visual elements: picture (15,887), line (6771), chart (3510) and table (251). Figure 3(c) illustrates the spatial distributions of these seven element types on the normalized space. This demonstrates the diversity of our data collection in terms of layouts of visual elements. From these statistical results, we can see that the information presentations in our data collection have rich visual elements and diverse layout structures.

Then, we extracted related spatial features from each visual element. Specifically, we recorded the element type and the spatial features, covering the z-index, position, size, and rotation. For the textual elements, we also extracted the alignment of the text. The reason to pick these features is that all of them are highly relevant to the spatial structure. We describe these spatial features as follows:

  1. 1)

    Z-index indicates the order of elements in the z-axis. An element with a higher z-index is on top of another element with a lower z-index. We represent the z-index using a natural number.

  2. 2)

    Position indicates the position of the element. We use the bounding box \((x_{1}, y_{1}, x_{2}, y_{2} )\) to represent the position of the element, where \((x_{1}, y_{1} )\) denotes the top left of the bounding box and \((x_{2}, y_{2} )\) denotes the bottom right. We adopt the normalized parameters by scaling \((x_{1}, y_{1}, x_{2}, y_{2} )\) to the range of \([0, 1]\).

  3. 3)

    Size \((w, h )\) is the normalized width and height of a visual element. Although this feature can be computed by the position feature, we explicitly represent this feature to help the computer understand this relationship more easily.

  4. 4)

    Rotation indicates the degrees of clockwise rotation of the element ranging from 0 to 360.

  5. 5)

    Alignment is a text-specific feature as a text can be aligned in the vertical and horizontal directions. It includes two attributes for vertical alignment (top, middle, bottom, and mixed aligned) and horizontal alignment (left, right, center, and mixed aligned).

To protect data privacy, we removed all the content-related information, such as text and images. The final dataset only keeps the features of each visual element mentioned before, which is enough for the study of layout structure. We open-sourced the dataset.Footnote 3 More discussion about the usage of the dataset is provided in Sect. 8.2.

5 Prediction of pairwise relatedness

Before we group visual elements in an information presentation, we first consider the relatedness between elements, which also implies the probability of grouping two elements together (R1). This section introduces a deep learning model to this end. First, the basic model architecture and the working mechanism is presented. Next, we introduce an improvement to enhance the performance. Finally, the training process and model implementation is briefly described.

5.1 Model architecture

To predict the relatedness between any two visual elements in an information presentation, we propose a Transformer-based [39] model for pairwise prediction. There are two main reasons for doing so. First, as an information presentation could be formalized as a design sequence [20], it is reasonable to adopt a sequence model in our task. Compared with traditional sequence models such as recurrent neural networks, the self-attention mechanism [39] in Transformer models can capture long-range contextual information in the whole layout design. Specifically, understanding the spatial information of all visual elements in the design as global context is helpful when predicting the relatedness between two visual elements. Second, the language model on the Transformer architecture has been widely adopted. It would be easy to adapt the pre-training to have a promising improvement with the growth of data collection.

The model architecture is demonstrated in Fig. 4. It relies on supervised training based on human-labeled information presentations (Fig. 4(a)). The input of the model is a series of visual elements parsed from the information presentation (Fig. 4(b)), and the output is an association matrix of predicted relatedness between elements. The core model is based on the Transformer backbone, which is similar to the latest work for learning layout representation [40]. It consists of three components: (1) an embedding layer to represent visual elements with heterogeneous features (Fig. 4(c)), (2) a Transformer encoder for contextualized representation (Fig. 4(d)), and (3) a prediction layer to predict pairwise relatedness (Fig. 4(e)). The model is trained to minimize the cross-entropy loss between the output and the ground truth (Fig. 4(f)).

Figure 4
figure 4

The model architecture and mechanism for predicting relatedness between visual elements

5.2 Working mechanism

Pre-processing. Consider the example layout design in Fig. 4(a). The pre-processing module (Fig. 4(b)) first parses the design file to extract all visual elements in the design and then orders it into a sequence of visual elements according to the z-index. We represent the design sequence as: \([x_{1}, x_{2}, \ldots, x_{n} ]\), where \(x_{i}\) indicates the i-th element in the sequence. According to the statistics in Sect. 4.2, we set the maximum sequence length to 128, as most of the slides in our dataset contain 128 elements or fewer. For efficient batch training, we fill the design sequence to this maximum length with special padding tokens.

Embedding. To abstract visual elements into a data format that is understandable to the Transformer model, we convert each visual element \(x_{i}\) into a set of feature embedding vectors. For the numerical features, including the z-index, position, size, and rotation, we embed them using the positional encoding method [39]. For the remaining two categorical features type and alignment, we embed them using a learnable embedding table. The final representation of a visual element is a single embedding vector \(e_{i}\) by concatenating all the feature embedding vectors as follows:

$$ e_{i} = \Theta \bigl( e^{t}_{i} \oplus e^{z}_{i} \oplus e^{p}_{i} \oplus e^{s}_{i} \oplus e^{r}_{i} \oplus e^{a}_{i} \bigr), $$
(1)

where \(e^{t}_{i}\) to \(e^{a}_{i}\) denote the feature embedding vectors for type, z-index, position, size, rotation, and alignment, respectively. Θ is a non-linear transformation function. In this way, the output of the embedding layer is a sequence of visual element embeddings: \([e_{1}, e_{2}, \ldots, e_{n} ]\), which will serve as the input to the Transformer encoder.

Encoding. The Transformer encoder (Fig. 4(d)) updates the element embeddings into contextualized vectors \([h_{1}, h_{2}, \ldots, h_{n}]\). According to the query-key-value attention mechanism in Transformer, the model will involve three learnable weight matrices: query weight matrix \(\mathbf{W}_{Q}\), key weight matrix \(\mathbf{W}_{K}\), and value weight matrix \(\mathbf{W}_{V}\). The self-attention mechanism computes the attention score between the query of element embedding \(e_{i}\) and the key of element embedding \(e_{j}\) as follows:

$$ \alpha _{i j}=\frac{1}{\sqrt{d_{\mathrm {k}}}} (\mathbf{e}_{i} \mathbf{W}_{Q} ) (\mathbf{e}_{j} \mathbf{W}_{K} )^{ \top}, $$
(2)

where \(\sqrt{d_{\mathrm {k}}}\) is the scaling factor to stabilize gradients during training. The output of the Transformer encoder is the weighted average of the value vectors of each element embedding \({e}_{j}\) with normalized attention scores:

$$ \mathbf{h}_{i}=\sum_{j} \frac{\exp (\alpha _{i j} )}{\sum_{k} \exp (\alpha _{i k} )} \mathbf{e}_{j} \mathbf{W}_{V}. $$
(3)

Predicting. In the prediction layer (Fig. 4(e)), we predict the pairwise relatedness between every two elements. We use an association matrix R with size \((n \times n)\) to denote the prediction result. In the matrix, each item \(\mathbf{R} (x_{i}, x_{j} )\) represents the relatedness between element \(x_{i}\) and \(x_{j}\) randing from 0 to 1, where 1 means the model is confident that the two elements should be assigned in the same group and 0 means there is no possibility to group them. The predicted relatedness \(\mathbf{R} (x_{i}, x_{j} )\) is the average value of two related attention scores \(r_{ij}\) and \(r_{ji}\):

$$ r_{ij} = \frac{\exp{\mathbf{W}_{Q}^{r}\mathbf{h}_{i} \cdot \mathbf{W}_{M}^{r}\mathbf{h}_{j}}}{\sum_{j}\exp{\mathbf{W}_{Q}^{r}\mathbf{h}_{i} \cdot \mathbf{W}_{M}^{r}\mathbf{h}_{j}}}, $$
(4)

where \(\mathbf{W}_{Q}^{r}\) and \(\mathbf{W}_{M}^{r}\) are two learnable matrices for query and memory according to the query-memory attention mechanism [41]. The \(r_{ij}\) denotes the attention score between the query of \(\mathbf{h}_{i}\) and the memory of \(\mathbf{h}_{j}\); vice versa for \(r_{ji}\).

5.3 Improvement

The spatial relations between elements, including the relative z-index and relative position in both the x-axis and y-axis, are informative for measuring the relatedness between elements. However, the basic attention mechanism in the current Transformer encoder only considers the absolute position of visual elements, while the relative information is not explicitly modeled.

To better utilize this information, we improve the model by incorporating the spatial-aware attention mechanism [42, 43]. Specifically, we compute a new attention score \(\alpha _{i j}^{\prime}\) by adding three learnable biases (relative position biases \(\mathbf{b}^{\mathrm{X}}\) and \(\mathbf{b}^{ \mathrm{Y}}\), relative z-index bias \(\mathbf{b}^{\mathrm{Z}}\)) to the original score as follows:

$$ \alpha _{i j}^{\prime}=\alpha _{i j} + \mathbf{b}^{ \mathrm{X}}+ \mathbf{b}^{\mathrm{Y}} + \mathbf{b}^{ \mathrm{Z}}. $$
(5)

We have empirically verified that adding this spatial-aware attention mechanism can increase the accuracy of the model, which our technique will use by default. The evaluation of the improvement is presented in Sect. 7.1.

5.4 Training

To train the prediction model, we first label the relatedness between elements as ground truth. The labeling process is simple: PowerPoint is used to group the related items (Fig. 4(a)). Each pair of elements is labeled as a binary value (one if they are in the same group or zero if not). It is worth noting that this labeling process only needs a single grouping hierarchy, so that general users without a design background can easily label the slides based on their perception and cognition. In addition, we expand the size of labeled training data by crowdsourcing as the task is easy to perform. The training objective is to minimize the cross-entropy loss between the association matrix from the model and the ground truth matrix (Fig. 4(f)).

5.5 Implementation details

The model was implemented in PyTorch [44]. All the trainable parameters are randomly initialized and updated via the Adam optimizer [45], with a learning rate of 0.0001. The training process enables a 0.3 dropout rate to avoid over-fitting. The maximum length of the input design sequence is 128 elements. The batch size is set to 16. The model was trained on an Nvidia Tesla-V100 (16 GB) GPU.

6 Grouping of visual elements

This section introduces a bottom-up approach to recover the hierarchical grouping from layouts of visual elements. We first define a graph model of the layout based on the connections of adjacent elements (R2) and then propose a graph-based algorithm to group the visual elements hierarchically (R3).

6.1 Graph modeling

According to the expert interviews, designers generally follow the principle of proximity to organize the layout and place related elements adjacently. Therefore, we only consider grouping elements in the neighborhood in our implementation. Thus, we can explicitly form a graph model to represent the layout using the spatial features of the visual elements (Fig. 5). Formally, we define the graph model G of the information presentation, where nodes are visual elements \([x_{1}, x_{2}, \ldots, x_{n}]\) and edges are connections between the neighborhood elements, where we denote the edge that connects elements \(x_{i}\) and \(x_{j}\) as \((x_{i}, x_{j})\). The edge includes a weight of proximity that can be computed by the spatial distance between two elements (denoted as \(d(x_{i},x_{j})\)). We use the largest distance among three perspectives: Euclidean distance, horizontal distance, and vertical distance [11].

Figure 5
figure 5

This is an illustration of the graph modeling for an information presentation, where nodes are visual elements (denoted as black circles), and edges are connections between two adjacent elements (denoted as blue lines)

6.2 Hierarchical grouping algorithm

Next, we use the bottom-up approach to group visual elements hierarchically. Our hierarchical grouping algorithm is inspired by the previous graph-based algorithm for image segmentation [46, 47]. Their objective is to group pixels or regions that have a similar visual appearance. Instead, our objective is to group visual elements relatedly and spatially. The basic idea of the algorithm is merging small groups into a larger group iteratively by considering relatedness and spatial distances between visual elements in different groups. The algorithm works as follows (Algorithm 1).

Algorithm 1
figure a

Hierarchical Grouping

The input of the algorithm is the graph model with visual elements \([x_{1}, x_{2}, \ldots, x_{n}]\) as nodes and all connections as edges \([(x_{1}, x_{2}), \ldots, (x_{i}, x_{j}), \ldots]\). The outputs are levels of hierarchy \(\mathcal{H} = [H_{1}, H_{2}, \ldots, H_{l}]\) from bottom to top, where \(H_{i}\) contains the groups at the ith level of hierarchy. We start with the initial grouping \(H_{0}\), where each visual element is assigned to an individual group (line 1). Next, we sort edges by non-decreasing spatial distance (line 2) and initialize two parameters τ and T. τ is the threshold parameter that controls the degree to which the distance between two groups must be greater than their internal distances in order for there to be evidence of a boundary between them [47]. T is the threshold parameter for the relatedness (line 3). These two parameters influence the granularity of the grouping. A larger τ and a smaller T lead to a higher level of hierarchy with larger groups. After the initialization, the algorithm begins to merge small groups into a larger one iteratively until all the elements are in the one group (line 4-18). In each iteration, we enumerate all edges by order (line 6). If two elements in an edge (e.g. \(x_{i}\) and \(x_{j}\)) belong to two different groups (line 7), we will consider whether to merge these two groups based on following two conditions:

Relatedness. The first condition is whether the predicted relatedness \(R(x_{i}, x_{j})\) between \(x_{i}\) and \(x_{j}\) is larger than the threshold T (line 9). \(R(x_{i}, x_{j})\) can be accessed from the association matrix R from our model.

Spatial distance. The second condition is whether the spatial distance between \(x_{i}\) and \(x_{j}\) is smaller than the minimal internal distance of both groups (line 8). The minimal internal distance MID is defined as:

$$ \operatorname{MID} =\min \biggl(\operatorname{I} (\mathcal{G}_{a} )+ \frac{\tau}{\operatorname{A}(\mathcal{G}_{a})}, \operatorname{I} (\mathcal{G}_{b} )+ \frac{\tau}{\operatorname{A}(\mathcal{G}_{b})} \biggr), $$
(6)

where \(\operatorname{A}(\mathcal{G})\) computes the area size of the group, and \(\operatorname{I}(\mathcal{G})\) computes the largest distance in the minimum spanning tree MST of the group \(\mathcal{G}\):

$$ \operatorname{I}(\mathcal{G}) = \max_{(x_{i}, x_{j}) \in \operatorname{MST}(\mathcal{G})} d(x_{i}, x_{j}). $$
(7)

If both conditions are true (line 10), the two groups will be merged into a larger group (line 11). After each iteration, if the number of groups decreases, it means the algorithm constructs a higher-level grouping \(H_{q}\) given the previous one \(H_{q-1}\) (lines 15 - 17). We will update the constant parameters by decreasing T and increasing τ before starting the next iteration (line 18). The constant parameters α and β can be manually tuned to adapt to different types of information presentations. In our implementation in PowerPoint files, we set \(\alpha = 0.9\) and \(\beta = 1.1\). Finally, the algorithm outputs all levels of hierarchy \(\mathcal{H}\) (line 19) that allow selection of the desired grouping at any desired granularity (R3).

7 Evaluation

In this section, we evaluate our technique, applied to hierarchical grouping of information presentations, through a technical experiment and a user study.

7.1 Technical experiment

We first present quantitative assessments of the prediction model. In the experiment, we manually labeled 1008 slides from our data collection. Although it only includes a small portion of the data collection, it can create enough pairs of visual elements for training. The labeled slides include 27,151 visual elements, which comprise 519,415 element pairs in total. We divide the labeled data as 80/20 for training and testing.

Given a pair of visual elements in an information presentation, our approach can predict the relatedness. If the relatedness score is higher than a threshold t (\(t=0.5\)), our approach will predict that they are probably in the same group. We use \(accuracy\) as the evaluation metric because we have the human-labeled group truth. Formally, \(accuracy\) is equal to \(M/N\), where M is the number of the correct predicted pairs, and N is the total number of pairs in the testing data.

We compared the performances of the following three machine learning models on the testing dataset. Note that we did not involve other layout-based machine learning models here because most of the works focus on layout generation [21, 48], which is not applicable in the grouping task.

  1. 1)

    Neural Network Baseline. We implemented a three-layer neural network as the baseline, which uses the same input feature embedding vectors of the visual elements and outputs the results of pairwise classification.

  2. 2)

    Our Transformer Model without Spatial-aware Attention. We implemented a basic version of our model without the spatial-aware attention mechanism. It shows the power of the original Transformer architecture.

  3. 3)

    Our Transformer Model with Spatial-aware Attention. We implemented the final version of our model by adding the spatial-aware attention mechanism, which can leverage the relative position information.

We summarize the performances of these three models in Table 1. The neural network baseline approach achieves 77.49% accuracy, while our approaches outperform it significantly by 8.23% and 8.83%, respectively. The result is expected because our models leverage the Transformer architecture to capture the contextual information from the whole layout structure. Next, let us look at the performance difference between our model with and without the spatial-aware attention mechanism. The accuracy of our final model (86.32%) is slightly higher than that without spatial-aware attention (85.72%), which indicates that using relative position information can improve the relatedness prediction in the 2D layout to some extent. The experiment shows the ability of our prediction model to capture the relatedness between visual elements in information presentations.

Table 1 Accuracy of the models for predicting relatedness between visual elements

7.2 User study

We conducted a user study to evaluate the quality of the generated hierarchies. Although there are several grouping techniques for graphic design [6, 11], they focus on structure extraction but not hierarchical grouping. Other related hierarchical grouping techniques [14, 24] require the design of information presentations following specific rules, such as using a grid layout or no overlaps among visual elements, which limits the scope of the techniques used. Therefore, we directly compared the results with the baseline from human designers. The experiment consists of two major phases: (1) Ten designers manually grouped visual elements hierarchically as the baseline. (2) The other 20 designers rated the results from the baseline and our approach. We hypothesized that our technique could produce a comparable result with the human designers.

Study Data. To prepare the study data, we first randomly sampled 50 slides from our dataset. We invited ten designers from our preliminary interview to manually group visual elements hierarchically in the slides. Before they started, we briefly introduced the grouping task and the basic usage of PowerPoint. During the task, each designer was given a PowerPoint file with ten different slides. They were asked to hierarchically group the elements in each slide. We record the completion time for each designer. Finally, each designer completed ten hierarchical grouping results, which gave us a total number of 100 labeled slides (one slide had two labeled versions from two random designers, respectively). An interesting finding was that designers may have different grouping solutions for the same information presentation. We also used our technique to automatically recover hierarchical grouping from these slides. To compare the automatic results with the manually labeled results, we constructed 100 comparison pairs. Each pair includes one result from a designer and one result from our technique for the same slide. We ensured that the generated grouping result had the same number of hierarchies for a fair comparison by deleting excess hierarchies.

Participants. We recruited 20 participants (14 females, aged from 22 to 31) from a design school. All the participants were familiar with design software such as Adobe Illustrator and Microsoft PowerPoint. They all confirmed having experience in making information presentations using Microsoft PowerPoint.

Task and Procedure. The experiment began with a brief introduction. Each participant was asked to complete 25 comparison tasks. In each comparison task, the participants were given two hierarchical grouping results and were asked to provide an answer from three choices (“A is better than B”, “B is better than A”, or “A and B are about same”). This setting was fairer and easier for participants to evaluate the relative quality of two grouping results than scoring a single result. The order of the grouping results for comparison was randomized. Figure 6 illustrates a snapshot of a side-by-side comparison used in one comparison task, where A is manually grouped by a designer and B is automatically recovered from our technique. Thus, we collected 25 comparisons × 20 participants = 500 responses. It took approximately 30 minutes on average (ranging from 25 to 40) for each participant to complete all tasks. After they completed the tasks, we asked them to leave comments for the result of their preferences.

Figure 6
figure 6

A snapshot of a side-by-side comparison pair used in the evaluation. (a) is manually grouped by a designer, and (b) is automatically recovered from our technique

Quantitative Results We analyzed the comparison results to investigate the preferences in terms of grouping quality. In general, the preferences for both sides are balanced, as demonstrated in Fig. 7(a). Specifically, 39% of the responses (195/500) remarked that the grouping results from our technique are better, and 42.4% of the responses (212/500) indicated a preference for the results grouped by designers. The rest of the responses (18.6%, 93/500) indicated that both results are approximately the same. Then, we further investigated the preferences of each participant. The mean of preferences per participant and 95% confidence intervals indicated that the preference for the designer’s grouping results is only slightly higher than our technique’s (Fig. 7(b)). According to the Student t-test (\(t(19)=0.45\), \(p=0.66\)), there was no significant difference between the preferences for both sides. On the other hand, our technique can save the extra time of manual grouping, which takes about 30 minutes (\(M=30.1\), \(SD=6.69\)) for designers to group elements in a ten-page slide in our user study. As a result, our technique can significantly improve efficiency and produce hierarchical grouping results comparable to human work.

Figure 7
figure 7

Comparison results. (a) The chart presents the distribution of responses in the comparison experiment; (b) The chart shows mean preference per participant and 95% confidence intervals

Qualitative Results During the study, participants were not informed that one in each testing pair was generated by an algorithm. After the study, we asked their opinions about the two results. All participants commented that they did not notice that the grouping was completed by machine. In participants’ opinion, most of the grouping results were about the same, but they might select one of them according to their personal preferences. For example, P4 commented: “Generally, there are no big differences between the two sides. However, when I look at a graphic design, I will think about how to refine the layout. Therefore I would love to choose the proper grouping that may make my design process more convenient.P12 said: “Most differences are very slight for me to give a choice. I think it is reasonable because designers may have different opinions when grouping the elements”. After we told them that one side was from an automatic approach, we asked them to comment on the effectiveness and usefulness of the technique. Eighteen participants appreciated the technique that might significantly improve the design efficiency. Furthermore, some of them offered some suggestions for future works. For example, P9 commented that it would be more convenient if each group could be automatically given a proper group name. P7 said: “I hope the technique can give personalized suggestions by learning from my grouping behaviors. For example, I can demonstrate operations in one design file. Then, I want all my downloaded public design resources to be processed as I demonstrated.” Although most participants gave positive feedback, two participants worried that the automatic grouping of elements might limit the creativity of the design to some extent. One of them (P10) said that the grouping structure may affect the designer’s strategy to the downstream tasks of layout design.

7.3 Sample grouping results

The gallery in Fig. 8 demonstrates eight hierarchical grouping results recovered by the automatic approach. (a) and (b) have three hierarchies, and the others (c)-(h) have two hierarchies. Our automatic approach has the ability to group visual elements properly at each level of granularity.

Figure 8
figure 8

Eight examples of hierarchical groupings recovered by our technique. (a) and (b) have three hierarchies, and the others (c)-(h) have two hierarchies

8 Discussion

8.1 Failure cases and limitations

We observe some failure cases during the evaluation. We summarize two main causes of the failure. One typical failure is that the current approach cannot understand the semantic meaning of a user-defined shape. For example, Fig. 9(a) shows a user-defined shape on the left that is composed of multiple freeform shapes. Our technique favors grouping them all into a single group at each hierarchical level. However, the designers in our user studies prefer to split it into two parts as a “computer” and a “phone” at the lowest level of grouping hierarchy. Although it is easy for human beings, our current technique cannot understand the semantic meaning of these shape compositions. Another primary cause of failure is that some complex connectors may affect the grouping accuracy. Simple connectors are lines or arrows to connect visual elements and express their relations. However, the complex one might be composed of multiple elements. For the example in Fig. 9(b), the connectors among three head portraits are a composition of a dashed line (line), a circle (preset shape), and an icon (freeform shape). Designers can recognize these connectors and group the head portraits with the corresponding text below. However, our current technique groups all the above visual elements as a whole by mistake. We believe that feeding more training data can mitigate this problem if the model can see more semantic groups. Our technique also has some limitations in terms of our research scope. The first limitation is that our current approach does not involve the content of each visual element, covering text, picture, and shape. Understanding the semantic meaning of the content may assist in understanding the layout structure, which is a promising future work. In addition, we could involve more semantic features, such as role labels (e.g., title, subtitle, and body). The second limitation is related to user preferences. As we found in the user study, there are multiple ways to group visual elements. Even two professional designers may have different strategies to perform the element grouping task. Our technique can provide the grouping results at different granularities but fails to adjust the strategy according to the user preference. As a participant in the user study mentioned, a promising area for future research might be learning from the user’s design behaviors. Third, this paper does not attempt to build a complete perceptual model of visual grouping. Instead, we propose a learning-based method for hierarchical grouping as a design aid within the domain of information presentations. To understand how designers group the visual elements, more theoretical research on visual perception should be studied further.

Figure 9
figure 9

Failure cases: (a) our technique failed to segment two semantic groups of freeform shapes; (b) it was affected by connectors

8.2 Usage scenarios

1) Usage of the dataset

We see our dataset of information presentations as a useful starting point for exploring layout structures in graphic design. In addition to our grouping research, this dataset is also helpful for researchers in other tasks. For instance, it can help researchers understand the design demographics of information presentations as the collected data are created by various designers. In addition, it can enable a series of studies for design-based machine learning, such as layout clustering, layout classification, layout recommendation, and layout generation. Moreover, this dataset can also serve as a rich design resource to inspire designers when designing layouts.

2) Usage of the technique

The technique of automatic hierarchical grouping can be applied to the design resources in the wild. Although professional graphic design tools (e.g., PowerPoint and Adobe Illustrator) provide the grouping function for users to organize the grouping structure, the given grouping information in the existing information presentations can be very noisy as designers might only care about the final presentation during the design process. Some unintentional actions or mistakes may also affect the grouping results. Our technique can fix this problem by producing hierarchical groupings for a large batch of information presentations in a short time.

9 Conclusion

In this paper, we present an automatic approach for reverse-engineering information presentations. We first collected a large-scale dataset with 23,072 well-designed slides as our research resource. Based on the dataset, we propose a technique to recover hierarchical grouping from layouts of visual elements. Our technique incorporates the advanced deep learning model to understand the relatedness between visual elements. It also includes a bottom-up graph-based algorithm to produce the hierarchical grouping by considering the relatedness and spatial distances. A technical experiment and a user study evaluate the power of the technique, which demonstrates comparable ability with human designers. We believe that future applications can benefit from our contributed dataset and technique.