Learning to assess visual aesthetics of food images

Distinguishing aesthetically pleasing food photos from others is an important visual analysis task for social media and ranking systems related to food. Nevertheless, aesthetic assessment of food images remains a challenging and relatively unexplored task, largely due to the lack of related food image datasets and practical knowledge. Thus, we present the Gourmet Photography Dataset (GPD), the first large-scale dataset for aesthetic assessment of food photos. It contains 24,000 images with corresponding binary aesthetic labels, covering a large variety of foods and scenes. We also provide a non-stationary regularization method to combat over-fitting and enhance the ability of tuned models to generalize. Quantitative results from extensive experiments, including a generalization ability test, verify that neural networks trained on the GPD achieve comparable performance to human experts on the task of aesthetic assessment. We reveal several valuable findings to support further research and applications related to visual aesthetic analysis of food images. To encourage further research, we have made the GPD publicly available at https://github.com/Openning07/GPA.


Introduction
Food is one of the most fundamental entities in our daily life. A great food photograph can convey feelings of warmth, awaken fond memories, conjure up fantasies, or just simply make you hungry [1]. It can also draw crowds flocking to a new restaurant or boost the sales of a food magazine. Thus, the ability to assess the aesthetic quality of food images plays an important role in various applications, such as food photo recommendation (see Fig. 1(a)), food photography assistance, and enhancement ( Fig. 1(b)).
Human beings can easily gauge the visual aesthetics of food photos. However, it remains challenging for artificial intelligent agents to do so. During the past two decades, many researchers have considered various related fields, such as image aesthetic assessment [2][3][4] and food image analysis [5][6][7]. Some have already explored aesthetic assessment of food images [8], but they resorted to hand-crafted visual features and did not perform quantitative studies on a large-scale dataset.
To endow intelligent agents with the ability to assess food image aesthetics, there are two major challenges to solve. Firstly, while there are some aesthetic image datasets [2,9] and food categorization benchmarks [5,6,10], no dataset is available for learning or evaluation of food image aesthetic assessment. Without a reliable dataset, we cannot investigate the topic quantitatively and provide scientific observations or insights. Secondly, prior knowledge is lacking in how to perform the task effectively. Any model needs to be regularized so that it generalizes well to unseen food images, from other sources than any training data.
To address the above two challenges, in this paper, we present the Gourmet Photography Dataset (GPD), containing 24,000 food images with corresponding binary aesthetic annotations. We have conducted a series of experiments with popular learning mechanisms for visual analysis tasks to verify the annotation quality of the GPD. We have also devised a non-statistical, effective regularization method, adaptive smoothing regularization (ASR), to combat over-fitting, to provide better generalization and better performance. We have quantitatively assessed the generalization abilities of optimized models on unseen food images. Extensive experiments in Section 5 demonstrate that the GPD provides practical help in tuning CNN models to predict important visual patterns allowing assessment of food aesthetics and to realize effective food photo aesthetic assessment. The proposed regularization strategy outperforms several common counterparts in the task of image aesthetic assessment. All these findings encourage further development in related applications of food image aesthetic assessment.
In summary, our contributions are as follows: • the GPD, the first large-scale dataset to support research into aesthetic visual assessment of food images; • a simple yet effective approach to properly regularize neural networks for enhanced generalization ability and better performance; • a system with promising performance for the task of food image aesthetic assessment, which demonstrates good generalization ability.
We also provide practical knowledge for further research.
A preliminary extended abstract of this work appeared at SIGGRAPH Asia 2018 [11]. The code and dataset can be downloaded from https: //github.com/Openning07/GPA.

Image aesthetic assessment
The goal of image aesthetic assessment is to gauge the aesthetics of input images; it has been extensively studied over the past decade. Early works on image aesthetic assessment resort to hand-crafted features [2,12,13]. Recently, thanks to large-scale datasets [2,9,14], convolutional neural networks (CNN) with effective learning mechanisms have been able to outperform their hand-crafted counterparts. Advanced methods have been developed, such as order-less multi-patch aggregation [15], aesthetic attribute graphs with adaptive patch selection [3], an Earth mover's distance based loss function [16], an attention-based learning scheme [17], visual feature aggregation [18], a semi-supervised deep active learning-based model [19], and multi-level pooling [4]. Progress in this topic has encouraged many aestheticaware applications (e.g., see Table 1). In this paper, we investigate the aesthetic assessment of food photos, which is an under-developed specific image domain with huge practical commercial value.

Food image analysis
There is an increasing amount of research into food images, because of its high value in commercial visual marketing. Many advanced methods and benchmarks have been proposed, e.g., for food categorization [6,10,25]), recipe retrieval [7], and calorie estimation [26]. In this paper, unlike previous literature, we attempt to investigate the possibility of aesthetic visual analysis of food images, with great potential value in visual commercial marketing.

Regularization method
In machine learning, regularization is intended to diminish generalization error, instead of training error. Developing effective regularization methods has always been a major research topic. Effective approaches include: softmax with temperature [27], label smoothing regularization (LSR) [28], dropout [29], data augmentation [30], etc. In this work, we propose an effective regularization strategy for image aesthetic assessment.

Our dataset
There were three steps to establishing the Gourmet Photography Dataset: food image collection, aesthetic label annotation, and inter-human agreement.

Food image collection
To learn how to assess food images as aesthetically positive or negative, we should aim for high variety during image collection, with respect to categories, viewpoint, lighting conditions, and layout. We collected food photos from the Internet and existing food categorization benchmarks. Firstly, we downloaded food images from four popular online communities: Flickr, Pinterest, 500px, and Pexels, using various food keywords (e.g., cakes, drinks, seafood) and regional cuisine indicators (e.g., Chinese, French, Mexican). We also retrieved images from various food categorization datasets [5,10] in a classbalanced manner to enrich data complexity. In this way, we collected a rich variety of images with varying complexity. After collection, we removed irrelevant instances, such as duplicated images, collages, and photos with observable artificial additions. We also conducted additional pre-processing operations to provide a meaningful training signal, such as removing unnecessary image borders and rotation calibration.

Aesthetic label annotation
Following existing literature [2][3][4]15], we treat visual aesthetic assessment of food photos as a binary image ,ŷ i is the aesthetic label for image I i , whereŷ i ∈ {0, 1} denotes negative or positive. Figure 3 illustrates the annotation procedure used to provide binary aesthetic labels for images; Amazon's Mechanical Turk (AMT) was used. Workers were asked to judge whether a displayed image looked aesthetically pleasing. Some food images are aesthetically ambiguous, leading annotators to spend much time to provide an answer with low confidence. To mitigate this issue and ease their anxiety over such images, workers were allowed to skip images for which they could not confidently provide answers. Ensuring high confidence answers is crucial to limiting time consumption and guaranteeing that labels contain meaningful cues, such as personal or cultural preferences, with high recall ratio. Images that were skipped three times or labeled validly are not reissued to further workers. Moreover, each worker was allowed to annotate 3000 images at most, to avoid allowing a few annotators to dominate the aesthetic perception of the dataset. Overall, 57 workers participated in the annotation procedure. We obtained 29,042 valid image-aesthetic label pairs, with 2647 photos skipped.

Inter-human agreement
To ensure high-quality aesthetic labels, we removed controversial labels where possible. Eight additional expert photographers with good aesthetic taste were invited to re-check the collected labels. For each image-label pair, they could agree or disagree with the label, based on the tips (e.g., lighting, colour, quality) in Fig. 2. If more than four experts agreed, the annotation was kept; otherwise, the label was regarded as ambiguous and discarded. During this process, 5042 instances were eliminated due to potential controversy. Most of those annotations come from a few AMT workers, who were perhaps unqualified for the task.
The results form the GPD: 24,000 food images with corresponding aesthetic labels, 13,088 positive and 10,912 negative. We show some instances in Fig. 2. For simplicity, in the following experiments, we randomly divided the GPD into two partitions: 21,600 (11,779 positive/9821 negative) images for training and the remainder for testing.  Our annotation process to collect aesthetic annotations, which allows workers to skip assessing photos of ambiguous visual aesthetics for high confidence answers.

Given a training set {I
of N image-aesthetic label pairs, we cast the aesthetic assessment as a binary classification problem and apply the crossentropy loss: whereỹ i is the label predicted by the model given an input image patch I i , and θ is the trainable parameter of the model Pr(· | ·, θ), which contains the feature extractor f (·, θ E ) and the classifier g(·, θ C ). We apply the softmax function σ(z i ) : R C → (0, 1) C , to calculate the confidence for the aesthetically positive class and the negative one, where z is a C-dimensional input vector (C = 2 in our case), and τ is the temperature parameter [27] to control the shape of the output probability distribution over different classes from the softmax (usually set to 1). When τ > 1, the margin between the maximum logit and the others for each z i will be diminished. Thus, the maximum confidence of aesthetic assessment reduce, relatively.

Motivation
As noted, we treat image aesthetic assessment as a binary classification problem. Arguably, the imagelevel aesthetic label cannot indicate differences in visual aesthetics for different local patches cropped from a single image. Without a proper strategy, we might overly penalize the negative class on patches from aesthetically positive images (see Fig. 4(right)).
In other words, if we train the model naively, the model may assign full probability to the target aesthetic class for each input instance, leading to overconfidence. Given the distribution of aesthetic scores of images in the AVA benchmark [2] (Fig. 4(left)), it is improper to require the model to output a prediction with 100% confidence for every input image. It is also revealed in Ref. [31] that CNNs with ReLU activation function always yield high-confidence predictions far away from the training data. To ensure the generalization ability of optimized models, we need to handle the over-confidence issue properly. Inspired by Laplace smoothing [32], which favors highlighting more certain examples while avoiding overly penalizing the others, we mitigate the over- confidence issue by adaptively smoothing the shape of output probability. The intuitive motivation is that we diminish the output values of target classes by introducing K smoothing vectors in the last fully connected layer.

Adaptive smoothing regularization
We introduce K vectors in the last fully connected layer to implement ASR, i.e., θ A . Figure 5 exhibits the core idea of ASR.
, the confidence of binary visual aesthetics will reduce. In this way, we mitigate the over-confidence and promote generalization ability. Put these ideas together, the ASR method may be stated as The corresponding derivatives are We further note that the K introduced vectors can also help maintain the pace of optimization, as lower target confidence always means higher error and strengthens the training signal for back-propagation, according to Eq. (4). We find that introducing K vectors is not enough, because their softmax output values decrease quickly as optimization progresses.
To maintain the smoothing effect, we propose three procedures: • Randomly select K vectors from the model pretrained on ImageNet [33]; • Every Freq comp training epochs (5 in our experiments), we pick the representation vectors of patches whose target class prediction confidence is relatively low (< 0.5), and use their mean vectors to update the K classes. • We update θ A incrementally (λ = 0.3 in this paper).
We give the main algorithm in Algorithm 1. Optimize θ C and θ A using Eq.(1);

5:
Evaluate the target class confidences and extract visual features on the validation partition;

6:
Select the feature vector whose target class confidences are around τ th ; // useful features 7: Conduct DBSCAN clustering on the features to remove outliers and get K centroids, θ A ;

8:
Use the centroids to update θ A to strengthen the regularization effect; Ep idx + = 1; 11: end while 12: Return θ C .
Discussion. Compared to existing regularization methods, such as LSR [28], confidence penalty, or data augmentation [30,34], the proposed method does not make the transformation-invariant assumption or assign some pre-defined values to curtail output confidences. Instead, we apply K introduced vectors to smooth the output space in a non-statistical manner. Moreover, the introduced vectors change flexibly during the training process. Thus, the proposed ASR is more flexible and reasonable.

Implementation details
We apply the SGD algorithm using a batch size of 32, with Nesterov momentum of 0.9 and weight decay of 5e −4 . We begin with a learning rate of 1e −3 , drop it by a factor of 0.1 after every 10 epochs, and keep it at 1e −5 after 20 epochs. We set K = 2 in Eq. (3) during the experiments, as we believe that one introduced vector per aesthetic class suffices to make the regularization effect work, as we will see in Section 5. Using a single NVIDIA Titan X GPU, the learning process takes about 17 hours to finish all 40 epochs. We implemented the method with Tensorflow.
Without loss of generality, we adopted the 18layer ResNet [35] (ResNet-18) model as the backbone network, using which we compared the results of different regularization strategies (see the last several rows of Tables 2-4).

Baseline comparison
To ensure reliable results and a systematic evaluation, we applied several typical vision learning algorithms to the GPD. Their performance, in terms of percentage assessment accuracy, helps us explore the possibility of visual aesthetic assessment on the food images and assess the quality of aesthetic labels in the GPD.

SVM with color
Color information proves to be important in image aesthetic assessment [12].
We encoded color information for images as color histogram features, with 128 bins for RGB color channels. Zero-meanunit-variance normalization was conducted before optimization.

SVM with GIST
GIST features [36] are another typical approach used to capture the global content of images. We extracted 512-dimensional gray-scale GIST features with an image size of 256 × 256. Zero-mean-unit-variance normalization was also performed as a preprocessing step to facilitate the following optimization process.

Implementation details
For training, photos were re-scaled with respect to the shortest edge (259 for AlexNet and 256 for the others), and then patches (227 × 227 for AlexNet and 224 × 224 for the others) were randomly cropped. Random horizontal mirroring (0.5) was conducted for data augmentation. To maximize the performance of each model, we applied different training hyperparameters, such as batch-size or learning rate for CNNs and cost coefficient for SVM, via crossvalidation experiments. For inference, we report and compare the average on 10 patches randomly cropped from input photos.

Evaluation on GPD
The results of the aforementioned methods on the GPD are listed in Table 2. Several typical aesthetically negative and positive instances are shown in Fig. 6. Figure 7 shows the histogram of confidences predicted by ResNet-18 on the training portion of the GPD. We also experimented with MP ada [17], a state-of-the-art approach for image  aesthetic assessment, using the authors' code from https://github.com/Openning07/MPADA. From the results, we obtained the following findings: • The scale of the GPD seems to be sufficient to support training learning algorithms. The gaps between training performance and testing results are generally less than 10% in each row of Table 2.
It should be noted that we do not adopt any data augmentation or complex training tricks. We also conducted the same experiments with different partitions of data, i.e., 10-fold cross-validation on the whole GPD, and obtained results close to those in Table 2. We may conclude that the results demonstrate the effectiveness of the GPD for assessing visual aesthetics of food photos. • Our regularization module outperforms other approaches. Comparing the results for ResNet models with different regularization strategies, our proposed ASR works better than other regularization strategies. We further observe that during the inference stage, the confidence values for positive/negative visual aesthetics are always larger than those for the introduced K vectors. This shows that that our regularization method works for the task of binary aesthetic visual assessment. • GPD-supervised CNNs achieve the best testing results amongst the tested learning mechanisms. Further, SVM with VGG features generally outperforms SVM with hand-crafted features. These findings of the effectiveness of CNNs are not new and are consistent with mainstream conclusions from the computer vision community [39]; they make effective image aesthetic assessment of food photos possible. Another interesting observation is that the visual features from VGG-scenes do not work as well as those from VGG-foods or VGG-objects. This demonstrates the importance of object semantics and food semantics in assessing visual aesthetics of food photos.

Generalization ability test
To test the generalization ability of tuned models, we collected 825 unseen food photos from WeChat, one of the largest online communities. We then invited 50 qualified candidates to give their opinions as to whether the photos looked aesthetically pleasing or not. Based on the 41,250 responses, we measured consistency of the aesthetic assessments from the models with human perception via the following equation:  Table 3; we calculate the best / worse results in a greedy manner. To better visualize the comparisons, we show some images together with aesthetic assessment results from different methods in Fig. 8. Based on these results, we draw some empirical conclusions Table 3 The results of generalization ability test on several approaches via food photos collected from WeChat, which is different from the sources where the images of GPD come from. as follows: • Human experts achieve results close to the theoretical best (75. 5, 83.9), and significantly better than random. These observations indicate that a good model for image aesthetic assessment on food photos should be able to generalize well, like human experts. They also indicate that the 825 food images with collected responses from reviewers can be used to test model generalization ability. • GPD-supervised CNNs possess good generalization ability in assessing visual aesthetics of food photos. The aesthetic assessments of food photos from GPD-supervised CNNs are consistent with those of human experts. For positive aesthetics, ResNet-18 with ASR even outperforms human experts in the experiment. This shows that neural networks tuned on the proposed GPD dataset possess good generalization ability in assessing visual aesthetics of food photos. Consequently, these results demonstrate the validity and utility of the GPD and the proposed regularization method in the task of aesthetic visual assessment on food photos (e.g., food image triage or recommendation). • Negative food visual aesthetics seem to be easier to assess than positive ones. We have more food images with positive aesthetics than ones with negative aesthetics in the GPD, whereas the tuned models consistently achieve better results for negative cases than positive ones. Supportive cues arise from the observations that V (S neg ) is generally higher than V (S pos ) across each row of Table 3. These results indicate that, on assessing the visual aesthetics of food photos, people achieve consensus on negative visual aesthetics more often than on positive ones, if they are forced to make a judgment. This insight provide some useful guidelines for further developments. For example, we need more training instances of positive visual aesthetics, and we should take personal preferences into consideration when offering certain services related to positive visual aesthetics.

Further investigation into K
To further investigate how the hyper-parameter K in Eq. (3) influences regularization and the final aesthetic assessment accuracy, we conducted additional experiments with different values of K on the GPD. To compute accurate statistics, we conducted 10-fold cross-validation; the results are exhibited in Fig. 9. It can be seen that K = 2 is generally a good choice, in comparison to other options. On the other hand, we note that improperly introducing K smoothing vectors can hurt the assessment accuracy. It is also interesting to note that, when we shift the backbone from ResNet-18 to InceptionV2 or AlexNet, sometimes the K smoothing vectors output confidences larger than the two main classes. Further work is needed to investigate the underlying mechanism to exploit ASR better.

Additional experiments on AVA
To make the proposed ASR more convincing, we conducted additional comparative experiments on the large-scale AVA benchmark [2], which is widely Fig. 9 The image aesthetic assessment accuracy goes with K in Eq. (3) on the test partition of the GPD dataset.
used [3,15,16,40]. Without loss of generality, we experimented with ResNet-18 [35] models using common regularization strategies, followed by the common pipeline on the AVA dataset [3,15,16,40]: e.g., we used 5.0 as the threshold value for binary aesthetic assessment labels, 230k images for training and the remaining 20k for testing.
The accuracy comparison is shown in Table 4. Our regularization strategy achieves comparable results to the state-of-the-art method [17] and outperforms other regularization methods. We do not claim superior results, but simply verify the effectiveness of our proposed ASR method, which works differently to existing methods [3,4,[15][16][17]. Consequently, we further verify the effectiveness of the proposed regularized softmax in the task of image aesthetic assessment.  [17] 83.0 ResNet-18 81.8 ResNet-18 + aug 80.9 ResNet-18 + LSR [28] 82.5 ResNet-18 + σ T [27] 82.3

Motivation
Another practical use for the ability to assess visual aesthetics of food image is to diminish bad instances (e.g., with observable artifacts) generated by CNN models. Currently, many researchers are working on image generation or enhancement [23,42,43]. However, the lack of effective methods to distinguish low-quality outputs impedes practical application of such methods.

Approach
With an aesthetic assessment model with good generalization ability, we gauged the aesthetic scores of original images and output ones, and then selected outputs with relatively high aesthetic scores or with moderate degraded score. This process is inspired by a refinement-based-on-evaluation procedure instead of manual annotation, and is akin to web-supervised learning [44] and evaluation without ground truth [45].
Intuitively, the generator model and the assessment model benefit each other in the long-term trend.

Results
We conducted an experiment to assist food image generation with pizzaGAN [46], a generative adversarial network (GAN) based model to generate pizza images conditioned by a pizza photo and a cooking instruction (e.g., add corn, or remove ham). All the original images and the manipulated results can be found at http://pizzagan.csail.mit.edu/#.
The results of aesthetic assessment on the original food images and the manipulated ones are shown in Fig. 10. With the procedure described above, we can distinguish good results from low-quality ones without the need for a time-consuming user study.
In this way, we can discard improper output from generative models, making related applications on food images more practical.

Conclusions
To support research into food image aesthetic assessment, this work presents the GPD, the first related complex, large-scale dataset with corresponding binary aesthetic labels. To combat over-confidence, we have given a simple yet effective regularization strategy, ASR, which can improve the generalization ability of optimized CNN models. Extensive experiments with several typical machine learning approaches demonstrate that the proposed GPD can provide valuable help, enabling computer vision models to predict visual aesthetic of food photos. Furthermore, the proposed regularization strategy is better than alternatives in helping CNN models to achieve generalization, on the GPD and the AVA. Even on unseen food photos, CNN models trained on the GPD and armed with the proposed ASR perform comparably with human experts in assessing visual aesthetics of food photos. All these empirical findings should encourage further research and practical applications related to aesthetic visual analysis of food images. For future work, we hope to expand the scale of the GPD and enrich its attributes such as viewing angle, layout, and scenes. We also hope to exploit the proposed dataset to further facilitate related applications in the specific domain of food images.
Beijing, from 1983 to 1987. From 1994 to 1997, he was a research engineer and senior research engineer at C-CORE, the Memorial University of Newfoundland, Canada. From 2000 to 2005, he was the Chinese Director of the Chinese-French Joint Laboratory for Computer Science, Control and Applied Mathematics. He is Senior Member of the IEEE.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.