Keywords

1 Introduction

Image cropping is an important step to improve the aesthetic quality of an image. It can be used in many applications such as efficient image transmission, image retargeting, or photo collages [1]. For choosing the “best” image cropping method it is important to consider on which datasets this method is validated and possibly trained. Computer Vision (CV) researchers benchmark algorithms using public available datasets [11]. This practice allows to achieve quantitative performance evaluation and comparison between algorithms as well as it helps new researchers to get, in a short time, a clear view of the state of the art performance. The availability of benchmark datasets is increasing in all the CV domains ranging from object tracking [15] to object recognition [7]. Although much effort has been made to enrich the number of available datasets, in contrast, a little effort has been made to assess the quality of the available ones. Quality is related to two aspects: (i) how much data are representative of the domain; (ii) how much the ground-truth is consistent with the modeled problem.

In this paper we provide a thorough analysis of datasets that are commonly used for the benchmark of autocropping algorithms. In the last 10 years several benchmark datasets have been proposed by the scientific community as well as suitable measures to rank algorithms in terms of match between ground-truth sub-region or sub-regions with the one or ones selected by the algorithm [2, 6, 21, 22]. Each dataset usually consists in a set of images and corresponding cropped sub-region or sub-regions as selected by human subjects. In order to evaluate the characteristic and effectiveness of the datasets in the literature, in this paper, we analyze the most used datasets in terms of: (1) statistics of the ground-truth, namely position, size and aspect-ratio of the candidate views; (2) diversity and coverage of each candidate views with respect to the original image; (3) performance evaluation and comparison with the state of the art of a dummy solution that crops regions with area ranging from 100% to 10% of the image area; (4) semantic analysis in terms of number of times a semantic concept is preserved in ground-truth sub-regions with respect to original images.

2 Related Works

2.1 Datasets for Image Cropping Assessment

Many databases for the evaluation of image autocropping methods have been proposed in the literature. Table 1 summarizes the available datasets. For each dataset we report the number of images it contains, the number of views per image i.e. the number of crops available for each image, the source of the images, how the crops have been determined (either by human annotation, or by an automatic procedure), who validated the crops i.e. if the human subjects were experts in the field or not, whether the different views are ranked by preference, and finally the corresponding reference where the dataset has been presented for the first time. Briefly, the characteristics of the five datasets in Table 1 are the following.

Table 1. Comparison of publicly available image cropping databases.

Comparative Photo Composition Database. The Comparative Photo Composition (CPC) database contains 10,797 images [21]. For each image, 24 candidate views with 4 standard aspect-ratios have been pooled among candidates automatically generated by exploiting existing re-composition and cropping algorithms. Finally, the aforementioned candidate views have been ranked by 6 Amazon Mechanical Turk (AMT) workers. The source of the images is quite diverse. It consists of a combination of images taken from different benchmark datasets in the literature: AVA [16], MS-COCO [13], AADB [10] and the Places dataset [24]. Most of the images contain two or more principal objects.

CUHK Image Cropping Database. The CUHK Image Cropping Database (CUHK-ICD) is a collection of 950 images gathered from the CUHKPQ dataset [22]. It contains seven classes of images, i.e. animal, architecture, human, landscape, night, plant and static. A cropped region is respectively annotated for each image by three different professional photographers. The images are taken from an existing image quality assessment dataset, the CUHKPQ dataset [18]. The images are of varying aesthetic quality and are of different image categories.

Flickr Cropping Database. The Flickr Cropping database (FCDB) contains 1,743 non-iconic images gathered from Flickr [2]. The cropping annotation for each image derives from the choices of four AMT workers who evaluated several candidate views manually drawn. 348 out of the 1,743 images are adopted as test set and is the dataset’s cardinality reported in Table 1. Also, since there are no multiple views for each image, and thus no ranking of different crops, in the table the “Ranking” attribute is set to “No” for this dataset.

FLMS Database. The FLMS database consists of 500 images crawled from Flickr [6]. These images have been selected for their imperfect composition and have different contents. Each image is cropped by 10 expert users on AMT who passed a strict qualification test. There is no ranking of the views. Each view is considered separately. No further details are provided in [6] about this dataset.

eXPert View Database. The eXPert View (XPView) database is a collection of 992 images with dense compositions [21]. This dataset has been created by the same authors of the CPC dataset in order to test their method on another, unrelated dataset. The origin of the XPView is mixed with the images taken from different sources. Specifically, the MS-COCO [13], the FCDB and CUHK-ICD datasets, and other, unspecified, sources. The candidate views have been generated as already described for the CPC dataset but with 8 diverse aspect-ratios. In this case, the candidate views are annotated by three experts, and a ranking of the views is provided. From the analysis of the dataset, each image has up to 23 views.

2.2 Image Cropping Algorithms

The problem of automatic image cropping has been traditionally tackled by designing ad-hoc algorithms based on different visual cues that are considered relevant. Many methods consider salient regions to guide the selection of the important portion of the images. For example, in [4], depending on the image contents, different cropping attributes are used such as faces, skin, saliency and the image category itself. Another example is [14]. Visual composition, boundary simplicity and content preservation are used in [6] as features to force a cropped image to contain a salient object. Yan et al. [22] propose a method for learning what features are important in a good crop among color, texture, foreground, shape complexity, sharpness, saliency maps, segmented regions, perspective ration, and prominent lines.

Recently, another category of cropping algorithms has emerged that incorporate aesthetic cues as a feature in order to select the best cropping region. For example, a Generative Adversarial Network is used in [5] with a discriminator that attempts to distinguish images of poor and good aesthetic quality. Aesthetic and gradient energy maps are used in [9] to learn a compositional model for the best crop. The View Finding Network (VFN) [3] tries to correctly rank candidate crops according to certain photographic guidelines learned on an aesthetically annotated database. In [19, 20] candidate crops are firstly generated and then their aesthetic is assessed to generate the cropped image. A similar approach based on aesthetic quality classification is the CNN-based Cascaded Cropping Regression (CCR) method [8]. Li et al. [12] propose an Aesthetic Aware Reinforcement Learning (A2-RL) framework to sequentially search the best cropping windows automatically generated by applying a set of cropping actions. Finally, a fast View Proposal Net (VPN) is presented in [21], where a teacher network, is used to teach the VPN (i.e. the student) to output the correct score rankings for the crops.

Fig. 1.
figure 1

Sample images from the five cropping datasets. Superimposed are the crop regions. For the CPC and the XPView datasets, reddish to greenish colors represent ranked crops from worst to best. (Color figure online)

3 Autocropping Ground-Truth: Candidate Views Analysis

Figure 1 shows some sample images generated from each dataset with the corresponding candidate views superimposed. For the CPC and the XPView datasets, the views are ranked so greenish colors represent the best ranked views while the reddish colors represent the worst ranked ones. As it can be seen, the views selected for both the CPC and XPView have a high degree of variability with the different views covering the most part of the original image. For the CUHK-ICD, we can see that the three views mostly overlap although in some cases they can be quite different (as in the case of the building and the cake). The single, favorite, candidate view of the FCDB dataset covers the relevant object in the image. Given the presence in the image of a single relevant subject, the images themselves seem quite simple to crop. Finally, for the FLMS dataset, again, the ten candidate views are quite diverse. For instance, the bounding boxes have small overlaps. This could indicate that the ten experts have different personal opinions about image aesthetics and composition. Following the above preliminary examination, we next analyze in details the five datasets, their annotations, and provide some observations on their use for the evaluation of automatic cropping algorithms.

Fig. 2.
figure 2

(Best viewed magnified.) Distribution of candidate views for each considered dataset with respect to 13 aspect-ratios commonly used in digital photography (a). Average error between aspect-ratios of the candidate views and closest standard aspect-ratios for each database (b).

3.1 Diversity and Coverage

For each dataset, we quantitatively analyze several properties of candidate views. Firstly, we investigate the aspect-ratio of the candidate views to understand how much these differ from common aspect-ratios. We categorize candidate views aspect-ratios into 13 common classes in still camera photography, namely 1:1, 5:4, 4:3, 3:2, 5:3, 16:9, 3:1 and their complementary versions (4:5, 3:4, 2:3, 3:5, 9:16, and 1:3) [23]. Figure 2a shows the distribution of candidate views aspect-ratios for each database. As it can be seen, the majority of candidate views for all the datasets has a 16:9 aspect-ratio. CPC dataset candidate views equally distribute among 1:1, 4:3, 16:9, and 3:4 aspect-ratios. The other datasets (CUHK-ICD, FCDB, FLMS, and XPView) have a larger variety of candidate views aspect-ratios. Figure 2b reports the error resulting from the categorization step, that is the average distance between candidate views aspect-ratios and the closest standard aspect-ratios for each dataset. The small error for the CPC dataset is motivated by the fact that candidate views were also sampled from standard aspect-ratios as described in Sect. 2.1. Instead, the error for the other datasets is higher because candidate views have been freely chosen by humans.

Secondly, we consider the surface of all candidate windows for estimating their diversity and also their coverage with respect to the surface of the whole image. We scale the size of images as well as the corresponding candidate views to the same fixed dimension, then the value of the pixel \(o_{ij}\) of the heatmap, which represents the probability of being part of a candidate view, is obtained as follows:

$$\begin{aligned} o_{ij}=\frac{1}{N}\sum _{n=1}^{N}\alpha (x_{ij}) \qquad \alpha = {\left\{ \begin{array}{ll} val, &{} \mathrm {if}\ x_{ij} \in W_n \\ 0, &{} \mathrm {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where N is the total number of candidate views for all dataset samples, \(W_n\) is the candidate n-th view and val corresponds to 1 for datasets that do not provide the rank of candidate views, namely CUHK-ICD, FCDB, and FLMS dataset. For CPC and XPView datasets, whose candidate views do not have all the same relevance, val is equal to \(w_n\), where it represents the rank of the n-th candidate view, normalized in the interval [0,1]. In Fig. 3 we display the heatmaps for all the datasets. The heatmaps are normalized in the range [0,1], where pixel value close to 1 means that there is a high probability that the corresponding pixel belongs to a candidate view. The high energy in the center of the heatmaps for all the datasets shows that many candidate views crop the central region of the image. Moreover, we highlight that the energy is very high for almost the entire surface of CUHK-ICD images, while it is lower in the edges of the other datasets, in particular, those of the CPC and XPView. This means that most of the CUHK-ICD candidate views cover almost the total surface of the image, while the CPC and XPView candidate views are very different from each other and focus on image regions much smaller than the entire surface. The previous qualitative results are validated by quantitative analysis. Precisely, we estimate the average percentage of candidate views coverage respect to the whole image area. The values obtained for each photographer P of the CUHK-ICD dataset correspond respectively to: 82.07 ± 14.74 for P1, 82.69 ± 17.89 for P2, and 80.49 ± 16.76 for P3. For FCDB dataset, it is equal to 65.55 ± 16.64. The percentage coverage obtained for the candidate views of the FLMS dataset is 58.59 ± 17.41. Finally, CPC and XPView datasets have similar statistics equal respectively to 41.82 ± 16.95 and 43.68 ± 17.07.

Fig. 3.
figure 3

Heatmaps showing the spatial coverage of all candidate views for each dataset.

Finally, we analyze the semantic of images and how the crop of candidate view sub-regions alter it. To this end, we exploit the Hybrid-CNN [24], a CNN trained using 3.5 million images for 1,183 categories, obtained by merging the scenes categories from Places database [24] and the object categories from ImageNet [17]. The table in Fig. 4a reports the percentage of times that the semantic concept of an image is maintained in the crop obtained by applying the candidate views. Figure 4b presents the distribution of semantic concepts on the images of each dataset. We can see that the distributions are very spread across all the categories and some peaks are present in correspondence of landscape concepts like: promontory, lakeside, and valley.

Fig. 4.
figure 4

Semantic analysis. Number of times semantic concept is preserved in candidate views for each dataset (a). Distribution on the 1,365 semantic concepts of the Hybrid-CNN for dataset images (b).

3.2 Performance Evaluation of Autocropping Algorithms

We measure the baseline by considering a dummy solution consisting of crops sampled in different ways with a surface that covers the image area with decreasing percentages from time to time. More in detail, we estimate the performance by cropping regions keeping from 100% to 10% of the image area, and by averaging the results of 100 iterations of random crops retaining from 100% to 10% of the image area. The evaluation metrics commonly used for cropping performance comparison are intersection-over-union (IoU) and boundary displacement error (BDE).

Intersection-over-Union (IoU). The intersection-over-union (IoU), also referred to as the Jaccard index, is essentially a method to quantify the percent overlap between the ground-truth candidate view and the predicted crop. Given the area of the ground-truth candidate view \(W_{\mathrm {GT}}\) and the area of the predicted crop W, the IoU is defined as follows:

$$\begin{aligned} IoU = (W_{\mathrm {GT}} \cap W) / (W_{\mathrm {GT}} \cup W) \end{aligned}$$
(2)

Boundary Displacement Error (BDE). The boundary displacement error computes the distance between the four edges of the ground-truth candidate view and the corresponding edges of the predicted crop. By denoting the four edges of the ground-truth candidate view and of the predicted view respectively as \(B_{\mathrm {GT}}(l)\), \(B_{\mathrm {GT}}(r)\), \(B_{\mathrm {GT}}(t)\), \(B_{\mathrm {GT}}(b)\), and B(l), B(r), B(t), B(b). The BDE is estimated as follows:

$$\begin{aligned} BDE = \sum _{j=\{l,r,u,b\}}{|B_{\mathrm {GT}}(j)-B(j) |/4}, \end{aligned}$$
(3)
Fig. 5.
figure 5

IoU and BDE obtained by comparing ground-truth candidate views with dummy central crops covering image areas from 100% to 10%.

Table 2. Comparison in terms of IoU and BDE respectively for CUHK-ICD, FCDB and FLMS datasets. The \(\varDelta Best\) row reports the absolute error between the result (in ) of the best method in the state of the art and the best value (in ) among the various scales of the dummy solution.

Results. We collect performance for all the datasets at varying crop scales both for the center and random dummy solutions: the two dummy solutions achieved performance that is not significantly different. Figure 5 exhibits the IoU and the BDE at varying center crop scales for each database. As it is possible to see, the performance for the CUHK-ICD dataset is initially very high, both in terms of IoU and BDE, and declines in an almost linear fashion as the surface covered by the dummy crops decreases. Achieved results for FLMS and FCDB are linear until scale 0.5 where they go down. Finally, performance is stably low for CPC and XPView datasets. Table 2 shows comparison, in terms of IoU and BDE, between several algorithms in the state of the art with the dummy solution. We include CUHK-ICD, FCDB and FLMS because they are the datasets commonly used for benchmarking cropping algorithms. From the table is clear that CCR [8] and AIC [20] are the best methods in terms of IoU and BDE for the CUHK-ICD dataset. However, the best dummy solution achieves, on the same dataset, an IoU that is about 2.7% lower and a BDE that is about 1.5% lower than the best in the state of the art. This behavior can be explained by looking at the heatmap of the CUHK-ICD dataset (see Fig. 3). The heatmap shows that the ground-truth candidate views cover quite completely all the image. VEN [21] algorithm is the best on FCDB and FLMS datasets. In the case of FCDB, the best dummy solution achieves a performance that is 7.4% and 1.7%, in terms of IoU and BDE, lower than the best in the state of the art. In the case of FLMS, the best dummy solution achieves a performance that is 22.1% and 6.3% lower than the best in the state of the art. FLMS dataset contains ground-truth candidate views at aspect-ratios that are quite different from the common ones (see Fig. 2b). Moreover, Fig. 3 shows that candidate views do not cover the entire image.

4 Conclusions

In this work we conduct a detailed analysis of the main datasets in the state of the art for the evaluation of autocropping methods in terms of statistics, diversity and coverage of the ground-truth crops. Results show that each dataset models the cropping problem differently. Moreover, CPC and XPView datasets consist of very diverse candidate views, and most of the datasets do not consist of candidate views having standard aspect-ratios. Comparison between state of the art and dummy solutions show that, in case of the CUHK-ICD dataset, comparable results with the best solution in state of the art can be reached by using a dummy autocropping strategy that does not crop anything. Results obtained on the FCDB and FLMS show that these datasets are more challenging and diverse, with the dummy solutions performing worse than state of the art algorithms, and thus making them more suitable for the evaluation of autocropping algorithms.