Optimal pagination and content mapping for customized magazines
- 1.5k Downloads
Traditional media such as magazines and newspapers are undergoing deep transformations as they cope with the high volume and dynamicity of currently available information. In addition, with the emergence of decentralized publishing models, there is an increasing need for automated tools for authoring high-quality documents. Moreover, much of the dynamic information on the Web could also profit from such mechanisms for automatic presentation and summarization.
This paper describes a solution to the problem of automatically producing a camera-ready magazine from a set of page templates and a sequence of variable content to be placed on those templates. The algorithm is able to find the optimal number of pages to hold the content, selecting the best templates to be used in the magazine in such a way that all pages are optimally used.
The algorithm was integrated to Adobe’s InDesign® software, extending it to perform text fitting and rendering of magazine pages. The complete workflow is described in this paper, as well as an empirical evaluation and a discussion of future research directions.
KeywordsVariable data printing Automatic document layout Mass customization Customized magazines
Information is currently produced, consumed and delivered in ways that directly affect traditional publishing businesses such as magazine publishers, where information (e.g. news stories, etc.) is usually produced by journalists and placed on pages designed to hold information with sizes known in advance. Such a scenario is changing as centralized publishing models are unable to cope with the high volume and dynamicity of information available from a myriad of sources. Therefore, a new model for publishing is required, combining graphical qualities and dynamic content obtained from the web.
Another noticeable trend is the appearance of decentralized self-publishing businesses, where any individual is able to create and distribute their own publications. An example of this trend is HP’s MagCloud® service: clients are able to send their own PDF  files and the service takes care of displaying, charging, printing and distributing.1
However, the task of assembling graphics and content to produce one’s own magazine usually requires graphic design skills and technical knowledge of a desktop publishing tool such as Adobe InDesign®2 or QuarkXPress®3 before a PDF file is produced. In this case, an automated system for assembling arbitrary content into camera-ready publications is still a desirable technology.
Systems for assembling documents are not new. Variable Data Printing (VDP)  systems evolved from earlier transactional businesses such as direct mail marketing, bank statements, bills and others , where static content is mixed with custom data (i.e. relevant client information). The goal was to assemble large numbers of document instances that are unique and targeted at single individuals. However, most VDP-related technologies, such as the PPML language  can only handle content that is more or less predictable, and a graphic artist must prepare for little document variability in order to produce a document template that serves as base for the customized instances.
1.1 Objectives and contributions of this paper
This work describes a method for the automatic construction of magazines where an arbitrary sequence of content such as text and pictures is optimally mapped to designer-generated templates, ensuring the best possible presentation of the content. In addition, users may require the content to fill a specified number of pages, or otherwise let the system automatically determine an optimal number for that content. Thus, this problem can be reduced to mapping content into one or more pages and selecting appropriate templates for each page, maximizing some quality measure. This is known in the literature as the pagination problem.
The proposed algorithm receives a content sequence and produces an optimal sequence of template selections matching the content. After producing the optimal content mapping and pagination, Adobe’s InDesign® software is used as a rendering engine. An extension to InDesign imports the layouts and performs typographic corrections such as text fitting on templates’ text placeholders.
A simple and extensible representation for different types of content;
A complementary solution to the recent works of Giannetti , to handle both pagination and text flow issues;
The algorithm is able to fit the content to a specific number of pages provided by the user, or select an optimal number by itself.
Apart from other pagination methods, the proposed pagination algorithm always fills every page completely, including the last page.
It is very easy for a user to perform changes in content or templates and quickly produce a new document for proofreading.
This paper is organized as follows. Section 2 describes related works. Next, in Sect. 3 the problem is described and the requirements for an optimal pagination are defined. The workflow for the solution is described in the same section and document and template representations are detailed in Sect. 4. The mapping and pagination algorithms are described in Sect. 5 and the generation of camera-ready documents is described in Sect. 6. In Sect. 7 results and performance are presented. Section 8 describes additional techniques for enhancing results through the use of parameters used by the algorithm. Finally, Sect. 9 present conclusions, and a discussion on possible improvements and future directions of this work.
2 Related works
It is usually hard to produce an aesthetically pleasing layout starting from unknown, variable content. To ensure aesthetic quality and reduce the search space, current approaches may constrain the problem to less flexible layouts and well-behaved content. Newspapers are an example where the layout is hierarchical by nature and news items are similar in appearance and length, rendering the construction of deterministic algorithms for this task less hard .
It is worth noting that given the complexity of the layout problem, several non-deterministic approaches for automatic document layout have been proposed [11, 15, 21]. These approaches attempt to maximize an objective function that encodes one or more aesthetic qualities, and rely on randomized optimization heuristics, such as simulated annealing  or genetic algorithms  to find the best layout. However, these approaches are known to require long execution times in order to produce acceptable quality documents, and usually present poor convergence to optimal solutions when given (often conflicting) multiple aesthetic criteria [12, 21]. Given the predictability of results, deterministic approaches are thus more desirable in a document production setting, where documents cannot be manually inspected for aesthetic flaws.
For most high-end publications such as magazines, a graphic designer produces page templates to be used in the publication, conveying the look-and-feel and appeal of that particular magazine. Approaches for the selection of templates based on grids have been investigated by Jacobs et al. [14, 22], describing a document authoring system using constraint-based templates to have multiple geometric representations for the same template depending on its content. Their aim was to enable a designer to provide multiple representations of a document using templates that would be adaptable for different media sizes and capabilities.
automatic picture cropping;
low-level formatting (e.g. hyphenation, line-breaking, justification, kerning, etc.).
As in our work, they also use an optimization method for performing pagination, based on Plass’s method , although admittedly a very complex adaptation that is tightly coupled with their rendering engine.
This coupled approach, however, leads to several shortcomings in our intended scenario. Their algorithm requires an intensive back-and-forth processing between the pagination algorithm, the rendering engine (for scoring the document formatting), and the constraint solver for the adaptive templates as well. Although the authors did not provide a more detailed performance analysis, we believe that the efficiency of such approach would not scale well for larger documents or mass production. The authors also point out that the adaptive template language is very hard to use when defining constraints between content placeholders.
Our approach, on the other hand, explores the separation between pagination and rendering, as this results in a more efficient workflow, and is also more adaptable to different workflows for custom documents. For instance, one may need different rules for content selection, or use different rendering engines for different media. Although our approach relies on static templates, the simpler template language makes authoring significantly easier.
More recently, Giannetti  described a model for linking elements from one or more streams of content over a set of page templates. The sequence of page templates used in the final document is determined by simple rules defined by the author, such as the repetition of a page template in even or odd pages, or only in the first page. Our approach uses the same streams of content, as for example there could be a stream of texts and pictures and another stream of advertisements to be placed across the publication, keeping balance between the amounts of each content. On the other hand, the choice of template is made optimally and not constrained by previously defined rules.
3 Problem statement and workflow
The goal of this work is to allow an author to produce a high-quality publication made of several pages based solely on its content and a set of templates for those pages. Therefore, the problem consists of splitting an input sequence of contents (e.g. headlines, texts, pictures, etc.) into a number of pages (possibly specified by the user). A set of page templates is provided, and for each page on the final document an appropriate template must be chosen to hold its content. A template is informally defined as a hand-made page design that holds one or more geometric placeholders, each representing a specific type of content, such as texts, pictures, advertisements, and others. Therefore, a pagination algorithm should include a mapping procedure that knows how to place content into a template.
Given a desired number N of pages, it is necessary to split the content into N parts such that for each part there exists a template able to hold it. Moreover, a sequence of content may fit in a template but produce wasted page space or the over/under filling of content inside a placeholder, causing an uneven distribution over a page. This accounts for the need for some placement error of a sequence of content into a template. Therefore, the splitting must be selected so that the placement errors are minimized (more details in Sect. 5.1).
The templates and their placeholders are rigid, as changes in their geometry are forbidden;
The input order for textual content should be preserved;
All pages must be filled entirely with content, including the last page;
The selection of templates must be globally optimal according to some quality measure. For instance, all text placeholders must have a similar text density and the wasted space around pictures (due to differences in aspect ratio between pictures and placeholders) must be minimized.
As the content sequence will be split, text is modeled as a sequence of blocks (whole paragraphs, sentences, words, or even characters) that are discrete and indivisible objects. Therefore, text flows are handled after choosing an appropriate partitioning of content. This accounts for a granularity of content, where a finer division of text blocks will result in a better document, at the cost of run time performance. Such division of text objects is assumed to be performed prior to the pagination algorithm and represented in the input content accordingly. This will be discussed in more detail in the following sections, especially in Sect. 8.1.
Files describing the input content and the templates are fed into a layout engine, which breaks down content into pages and selects appropriate templates for each page as described in Sect. 5. Both input and output of the workflow are made through XML files described in the next section.
4 Document format
XML  files are used for the pagination and could be extended to a more complete document description language, such as PPML  or the higher-level Document Description Framework (DDF) . Three XML formats are used for the different elements in our approach: content sequences, templates and the resulting mapping of content of a sequence into page templates. These will be briefly described in the following sections.
4.1 Content sequences
4.2 Page templates
Adobe InDesign® was used to perform the rendering of documents, but it was also used for the authoring of templates using the format from Fig. 3.
4.4 Introducing new objects
As templates handle the same types of object as contained in the input sequence and a placement error can be measured for each type (see Sect. 5.1), it is possible to extend the model and introduce new types of content as necessary. For instance, publications may define an advertisement type and some of the available templates will be able to hold such objects. As another possibility, using the picture type could be avoided altogether by introducing two new types smallpic and bigpic to be used only in specific places on the templates, gaining more control of the final look of the publication.
Composite elements can also be used to prevent related objects to be split apart in two different pages by the pagination algorithm. For instance, an article object type may be composed by a headline, two columns of text and a picture for that article. The entire article will appear as a single object to the pagination algorithm, and as a consequence its contents will appear together in the final document.
5 The pagination and mapping algorithms
This recurrence can be implemented as a dynamic programming algorithm that fills a table Pn,p holding information to optimally divide a sequence of n elements into p page templates.
In (1), σ(i,j) is an error function that returns the minimal error obtained by the templates from the set T of templates when holding sub-sequence [i,…,j] of contents. The σ function must also keep information about which template is the best choice for the given interval, for further use. The error function is given in Sect. 5.1.
When n=1, only the first element of the sequence remains, so it has to be placed in a single page template, which is selected by the σ function with σ(1,1);
When p=1, all elements of the sequence 1,…,n must be placed in a single page template, selected by σ(1,n);
Otherwise, an optimal page break must be found, so the recurrence attempts to split the content sequence into two groups at every possible break point in the sequence 1,…,n−1, and solving the problem recursively at the left side of each break point. By minimizing the largest error σ between the left and right sides of the sequence, it is possible to find a sequence of page breaks that minimizes the global error from σ for every page 1,…,p.
During template evaluation, it may happen that some element in the sequence does not fit in the candidate template, or some placeholder from that template may be left empty. In that case, this template must be rejected, by attributing to it an infinite error (σ(i,j)=∞). The pagination algorithm will then select different sequences of elements for mapping. If no valid sequences can be found, the algorithm will fail to provide a solution. Therefore, a different or larger set of templates must be provided to handle the same input sequence. If input order is not important, reordering the sequence can also be attempted to yield a valid solution.
5.1 Template scoring and mapping
The actual measurement of the placement error is performed by allocate(t,i,j), which receives a template t from T and a sub-sequence [i,…,j]. This procedure is described below.
5.2 Allocating elements to placeholders
Once an interval [i,…,j] of elements has been selected for placement, it is necessary to map this interval to the candidate template and measure the placement error. Templates that cannot match the sequence are discarded before attempting any mapping of content.
To map pictures, we try every ordering of pictures on the page to minimize the wasted space by aspect ratio differences between the pictures and their placeholders, since no automatic cropping or non-proportional scaling is performed. However, a picture is only moved out of order if it is unrelated to the other contents on the input sequence. The coupling between pictures and other pieces of content can be specified in the input content by the use of composite types, described in Sect. 4.4.
To calculate the final error for the template, the maximum placement error is used among all placeholders of the template. The pagination algorithm then uses this error to find the solution that is the global minimum, effectively minimizing the maximum placement error.
6 Rendering with InDesign
The mapping XML file (Sect. 4) generated by the pagination algorithm is read by InDesign;
Each page is assembled with its placeholders and their corresponding content from the content sequences;
Each type of element (i.e. text, headlines, pictures, etc.) is rendered according specific rules, as described in the following sections.
Additionally, for composite elements (a captioned picture for instance, see Sect. 4.4), the rendering is performed recursively inside the element’s corresponding placeholder on the page. The following sections describe how each type of element as described in the document model is rendered.
6.1 Rendering headlines
Headlines are basically single lines of text that must fit inside their associated placeholder in the page template. It is assumed that the placeholder size is the intended size for its content as well, therefore typographic adjustments are made by changing the font size, so that the headline fits its placeholder entirely.
6.2 Rendering texts
The text placeholders are sequentially linked in each page template and across the pages. When the document contains a single sequence of content, a simple heuristic is to consider each headline occurring in the sequence as the beginning of a new “article”. Thus, every text placeholder is connected sequentially (as given by the input order) until a new headline appears. Then a new flow begins and the process is repeated. If the content is already split into several sequences, another approach is to simply associate every text placeholder to each sequence.
After the flows are created, the font size for each flow is adjusted across text placeholders belonging to this flow, therefore completely filling up every text placeholder. Since this adjustment is performed for a whole series of placeholders, the font size is changed only very slightly, and is uniform across the whole flow.
It is also possible to avoid text flows at all, by adjusting each text placeholder separately. However, if text densities change too much between adjacent placeholders, the font sizes will be visibly different, creating an unpleasant effect.
Flowing information could also be explicitly defined in the template language, but such idea is not explored in this paper. However, the template language may easily be extended to support explicit text flows.
6.3 Rendering pictures
To place a picture in a placeholder, one of the picture’s dimensions (width or height) is scaled to that of the placeholder, and the other dimension is set so that the picture’s aspect ratio is preserved and no part of it is clipped. Automatic cropping of the picture is not attempted, although it would be possible to integrate such a feature.
7 Results and discussion
This section presents results produced by the algorithm using empirical data and InDesign to generate ready-to-print documents.
In Sect. 7.1, the method is compared to a simpler approach, where quantitative measures show that the pagination algorithm scores significantly better. Consequently, possibly longer running times are justifiable if better-quality documents are required. Section 7.2 discusses the performance of the pagination algorithm, comparing it to a simpler but efficient method.
A qualitative evaluation is presented in Sect. 7.3, where output documents are compared to others generated with a simpler pagination approach, as well as a comparison between the proposed method and a real-world magazine, although some limitations must be considered.
7.1 Minimization of worst error and density variation
The input is the same as before: a content sequence and a set of templates;
At each step, every template is tested against the current sequence of content, in order to evaluate how much content can be consumed by this template;
The template that uses the most content with the least amount of error is selected and the sequence of content is reduced.
The algorithm repeats from step 2 until the content sequence is empty.
Given the greedy nature of a first-fit pagination strategy, sometimes there will be no way to fill up a template, leaving empty placeholders in the output document. For the purposes of this evaluation, this is going to be allowed for the first-fit method (when it fails to find an adequate solution) by assigning very large errors (more precisely, 106) to unused placeholders.
As the optimal pagination algorithm has a wider range of available solutions to search than simpler pagination strategies, it is able to select a solution that is more homogeneous (according to the objective function from (1)), where the content is well distributed over the document pages.
7.1.1 Generating test instances
The folha dataset (obtained from a Brazilian newspaper’s RSS feed),4 which is comprised by a large number of text articles (mostly short) and pictures;
The lipsum dataset, which is smaller and contains larger, randomly generated text articles.
More detailed description on the content datasets
Avg. text length
7.1.2 Single-instance results
Another important detail in Fig. 12 is that the placement error from the first-fit pagination increases abruptly at the end. This is a well-known tendency for the first-fit method, as it uses the content sequence to fill up pages until there is a small sequence left at the end, causing the last page to be under-filled. This does not happen in the method proposed in this paper, as it guarantees an even distribution of content.
7.1.3 Multiple-instance results
More results are presented for several test instances, using both datasets from Table 1. However, only text objects were used, because the inclusion of headlines and/or pictures would cause the first-fit pagination to fail more often and result in worse choices. Test instances were generated with the number of text articles ranging from 50 to 100 broken in smaller objects of sizes from 30 to 250 words each.
The system described in Sect. 3 was implemented using Java. Given real-world time constraints, the pagination algorithm (see Sect. 5) had to be implemented using dynamic programming, for producing output in acceptable running times. The allocation algorithm from Fig. 9 also uses memoization in order to store the best template selections for each different sequence of elements.
Using these optimizations, the worst-case performance of the pagination algorithm is bounded by O(n3) when the number of pages is unknown and O(k n2), when a number k of pages is defined, where n is the size of the input sequence. The cost for memoizing template choices depends on the number of templates and the methods for allocating elements to placeholders, described in Sect. 5.2. In the worst case its asymptotic bound is O(|T| n!), where |T| is the number of available templates. This is due to the allocation method for pictures, which requires a search for every possible ordering to find the best match in pictures’ aspect ratios. If reordering is not allowed, the bound drops to O(|T| n2). In practice, however, this cost is low because templates are checked for matching before attempting allocation and usually they contain just a few placeholders.
Although the performance degradation in the optimal algorithm can be severe when processing very large sequences of content, we found that the algorithm performs well on typical input sizes. For instance, a magazine comprised of 250 elements (20 % pictures and 80 % text) matched against 500 templates was generated in under a minute, running on an Intel® Core™ 2 Duo 1.86 GHz machine with 2 GB of memory. On the other hand, increasing the number of available templates does not seem to impair performance quickly, according to Fig. 16(b).
7.3 Camera-ready results
Though no user studies were performed in this paper for qualitative assessment, camera-ready PDF files produced by the workflow from this paper are presented below. The documents are a proof-of-concept for the pagination and mapping algorithms, and no further attempts are made to enhance the templates with features such as visual styles, fixed elements on a page, page imposition, and others. Therefore, pages contain only the variable content that has been mapped to them. However, a visual comparison with a real-world example is still useful as it provides basis for future enhancements.
Templates used in this test were not created by graphic designers, but were generated automatically using different configurations of a set of boxes of fixed geometry. Approximately 300 different templates were generated using this method.
It is important to mention that the templates were not generated with aesthetic concerns in mind (i.e. no attempts were made to produce aligned or well distributed containers), but rather to demonstrate how the pagination algorithm produces balanced distribution of content among pages.
7.3.1 A real-world magazine
7.3.2 First-fit comparison
The first-fit algorithm generated 40 pages, and only the first 20 are shown in Fig. 19. This was almost twice the number than generated by the optimal method, due to the regions left empty by the first-fit method. As it can be seen, this can impair both document readability and aesthetics.
As discussed in Sect. 7.1, due to the poor set of choices from the first-fit method, some placeholders could not be filled and little content has remained in the last page, as this method is unable to attempt a more balanced division of content to pages.
Results indicate that the proposed algorithm is able to generate high-quality solutions in practice. Given that the evaluation was made by comparing the algorithm to a first-fit heuristic, one may point out that an exhaustive method will always win against a simple heuristic. We chose a simple heuristic due to a lack of options, as we are not aware of any other openly available method for performing automatic pagination and template selection for variable content. Aesthetic evaluation  would also be of limited value, given that templates are rigid, so the aesthetic measures would be more dependent on the templates’ design than on the content mapping itself.
8 Tuning documents
Fine tuning of text granularity (i.e. breaking input text into smaller elements, such as paragraphs, sentences, words, etc.);
Automatic selection of the optimal number of pages for the publication.
These extensions are discussed in the next sections.
8.1 Changing text granularity
It is not possible to make text objects flow across different pages;
Large text objects may be forced to fit into a small placeholder, resulting in a bad distribution of text densities over the document, damaging the balanced distribution requirement for the proposed algorithm.
Both problems can be circumvented by breaking large text objects into smaller units prior to sending them to the pagination algorithm. This is a feature that can be made transparent to the user, given that the smaller text objects still preserve their input order, and will appear more evenly spread across the pages.
Running times for low and high granularities of text objects
8.2 The optimal number of pages
Given the recursive nature of the splitting algorithm as it constructs an optimal pagination of p pages by solving an optimal pagination of p−1 pages first, it is possible to select the optimal number of pages by solving the problem starting from the longest possible magazine, that is, having one element on each page, and the solution of this problem will include all shorter versions of it (including the extreme situation of all elements on a single page), so that it is only necessary to choose the number of pages that produces the minimal error.
This work presented a new algorithm for the construction of personalized documents using page templates and optimally choosing the best templates to hold the content. In addition, the algorithm can be used to find the best number of pages and the best templates for a given amount of content. Thereafter, a description of the document is sent to a standard tool from the printing industry and graphical quality comparable to high-end publications was achieved. This approach can be easily integrated into workflows that require automatic pagination and mapping of content, such as VDP or self-publishing services.
Although the quality of the solutions cannot be compared to hand-crafted publications, a middle-ground is provided between purely automated solutions and high-quality, non-personalized documents.
The self-publishing scenario described earlier could also benefit from an automated approach. For instance, authors with no experience in graphic design could simply submit their content and select a “style” for his/her magazine (i.e. different sets of templates), leaving the actual layout production to the publishing service.
The contribution presented in this work is the core of a larger workflow, of which we presented only a simplified model, leaving room for several possible improvements. For example, it is not possible to toggle on or off optional element placeholders or convert between elements (placing a picture on a text placeholder if necessary, for example). Moreover, the template selection method searches for multiple combinations of picture placement on a page, which can be costly for the generation of picture-driven documents such as photo albums, and could be disallowed for a faster algorithm. Images could also have scaling constraints added to the scoring function from (3) (Sect. 5.2). For example, busy pictures would only be considered for placement in larger containers. This would result in a more efficient method with better results as well.
To handle these issues, the use of adaptive grid-based templates, as suggested by Lin  and Jacobs et al.  could be an alternative. However, the difficulty in automatically specifying relations between elements in adaptive templates would hinder flexibility in automated workflows. We believe that in the near future the construction of personalized publications will be more content-driven, so devising a set of complex rules for broad scope content selection would be a challenge.
Regarding the pagination algorithm, while we do not consider line-breaking and justification issues , the decoupling between rendering and layout evaluation is cleaner and encourages re-usability of the method. What is needed today is not a complete monolithic system, but rather the easy integration with other personalized workflows, to reach a larger scope on web-driven publishing. RSS delivery  and online photo albums  are a good example of possible integrations.
Finally, one interesting improvement that could be made to the pagination algorithm is the handling of solutions with too many repeated instances of the same templates, which usually result in a monotonous publication. A simple mechanism to count the number of times a template has been used and penalize it in the error function would be sufficient to allow for more variability on the appearance of the document, but further investigation is still necessary.
This paper was achieved in cooperation with Hewlett-Packard Brasil Ltda. using incentives of Brazilian Informatics Law (Law n°. 8.2.48 of 1991).
- 1.Podi, Personalized print markup language (PPML) 2.2 (2008). http://www.podi.org/
- 2.Desktop publishing software—adobe indesign cs4 (2010). http://www.adobe.com/products/indesign
- 3.Photo books—create a wide variety of photo books using your favorite photos from snapfish (2010). http://www2.snapfish.com/photobookcategory/COBRAND_NAME=snapfish
- 5.Cohn R (1993) Portable document format reference manual. Addison-Wesley Longman Publishing Co, Inc, BostonGoogle Scholar
- 11.Goldenberg E (2002) Automatic layout of variable-content print data. Master’s thesis, School of Cognitive & Computing Sciences, University of Sussex, Brighton, UK. http://www.hpl.hp.com/techreports/2002/HPL-2002-286.html
- 15.Johari R, Marks J, Partovi A, Shieber S (1997) Automatic yellow-pages pagination and layout. http://citeseer.ist.psu.edu/johari97automatic.html
- 18.Morrison M, Brownell D, Boumphrey F (1999) Xml unleashed. Sams, IndianapolisGoogle Scholar
- 20.Plass MF (1981) Optimal pagination techniques for automatic typesetting systems. PhD thesis, Stanford, CA, USAGoogle Scholar
- 24.Skiena SS (1998) The algorithm design manual. Springer, New YorkGoogle Scholar