Joint regression and learning from pairwise rankings for personalized image aesthetic assessment

Recent image aesthetic assessment methods have achieved remarkable progress due to the emergence of deep convolutional neural networks (CNNs). However, these methods focus primarily on predicting generally perceived preference of an image, making them usually have limited practicability, since each user may have completely different preferences for the same image. To address this problem, this paper presents a novel approach for predicting personalized image aesthetics that fit an individual user’s personal taste. We achieve this in a coarse to fine manner, by joint regression and learning from pairwise rankings. Specifically, we first collect a small subset of personal images from a user and invite him/her to rank the preference of some randomly sampled image pairs. We then search for the K-nearest neighbors of the personal images within a large-scale dataset labeled with average human aesthetic scores, and use these images as well as the associated scores to train a generic aesthetic assessment model by CNN-based regression. Next, we fine-tune the generic model to accommodate the personal preference by training over the rankings with a pairwise hinge loss. Experiments demonstrate that our method can effectively learn personalized image aesthetic preferences, clearly outperforming state-of-the-art methods. Moreover, we show that the learned personalized image aesthetic benefits a wide variety of applications.


Introduction
The explosive growth of digital images has spawned automatic image aesthetic assessment, which is an important research problem that benefits a wide variety of applications, including photo album management, automatic image enhancement, image retrieval, and media recommendation. Despite being studied for decades, this problem remains a challenge because of the inherent uncertainty and subjectivity. While recent learning-based methods have made remarkable progress by leveraging the advantage of CNNs in scene understanding and feature learning, they are mostly designed for learning a universal image aesthetic assessment model that represents the average preference. However, in most applications of image aesthetics, e.g., automatic image/video beautification and recommendation [1][2][3][4][5][6][7], the user's personal preference for an image is usually more desirable than average preference, since different users may have substantially different preferences for the same image, as demonstrated in Fig. 1.
Compared to generic or universal image aesthetic assessment, personalized image aesthetic assessment is a more challenging problem. Large-scale datasets (e.g., AVA dataset [8] and AADB dataset [9]) labeled with average human ratings or attributes already exist for generic aesthetic model training. In contrast, it is usually impractical to collect a large number of personal images labeled with the owner's visual preference, since not everyone maintains a large photo album, and rating image aesthetics could be tedious and unreliable for a single human agent.
Some research efforts have been made to tackle the personalized image aesthetic assessment problem. Ren et al. [10] proposed a residual-based model for accommodating individual aesthetic taste, while Park et al. [11] integrated personal preference into a generic aesthetic model by training a support vector machine (SVM) over pairwise ranking information. More recently, collaborative filtering [12,13] has been employed to assess personal aesthetic preference [14,15]. Despite the notable progress achieved by these methods, they still have limitations. Firstly, they usually collect absolute preference ratings of each personal image from the user, but we have found that such ratings are often unreliable since it is extremely difficult for a person to explicitly quantify his/her visual preference into discrete rating levels (e.g., the commonly adopted 1 to 10). Secondly, these methods may fail to effectively learn personalized aesthetic preferences from limited personal data.
In this paper, we present a novel personalized image aesthetic assessment method that is able to address these limitations of previous methods. Firstly, instead of directly collecting absolute aesthetic scores, we argue that it is more practical and reliable to collect relative preference rankings between images, since it is usually much easier and more reliable for a person to tell which one of two images he/she prefers, than to rate a single image with an absolute score. Thus, we ask the user to state his/her preference for a small number of image pairs randomly sampled from the collected personal images. Next, we enrich the pairwise ranking information by inferring new rankings from user-annotated rankings based on ranking transitivity, which largely remedies the lack of labeled data and allows us to more effectively learn personal preferences. Specifically, our approach comprises two stages. We first search for K-nearest neighbors of the collected personal images within a public aesthetic annotated benchmark dataset, and then train a generic aesthetic model from the searched images and the corresponding aesthetic scores using CNN-based regression. Finally, we adjust the generic model to fit the personal preference by learning from the pairwise rankings with a hinge loss.
In summary, the major contributions of this work are: • a novel approach for learning personalized image aesthetics from very limited personal data, by joint regression and learning from rankings; • extensive experiments to evaluate the proposed approach and compare it with various existing methods; results show that our method can more effectively learn personal aesthetic preferences; • a demonstration that our learned personalized image aesthetics can be applied to customizing image retouching applications for a specific user, including exposure correction, color enhancement, and image dehazing.

Generic image aesthetic assessment
Most existing image aesthetic assessment methods aim to learn a generic aesthetic model based on the assumption that an implicit consensus exists about perceptually pleasant images. Early works treat image aesthetic prediction as a classification or regression problem of directly mapping hand-crafted visual features to aesthetic ratings provided by human raters [16][17][18]. With the emergence of large-scale aesthetic analysis datasets and deep neural networks, significant progress has been made towards automatic aesthetic assessment. Lu et al. [19] presented a multipatch aggregation network for aesthetic classification, which was then improved in Ref. [20] to incorporate a visual attention mechanism. Mai et al. [21] introduced a scene-aware network with adaptive spatial pooling to learn image aesthetics. Kong et al. [9] achieved photo aesthetics ranking by jointly learning image attribute and content information. Talebi and Milanfar [22] predicted the distribution of aesthetic scores using a convolutional neural network. Zeng et al. [23] presented a unified probabilistic formulation for image aesthetic assessment, while Zhang et al. [24] achieved unified aesthetic prediction through a gated peripheral-foveal convolutional neural network. More recently, Pan et al. [25] developed an image aesthetic assessment model assisted by attributes through adversarial learning. Wang et al. [26] devised a non-reference image quality assessment method for synthetic images based on convolutional neural networks and local image saliency. Sheng et al. [27] proposed the use of self-supervised feature learning for aesthetic prediction. See Ref. [28] for a survey of generic image aesthetic assessment.

Personalized image aesthetic assessment
Recently, there has been some research efforts towards personalized image aesthetic assessment. Ren et al. [10] achieved the goal by exploring the correlation between individual user's preferences and generic aesthetic perception, while Park et al. [11] adopted ranking information between images to train an SVM to predict personal preferences. Another main line of research is to use collaborative filtering, a fundamental algorithm used by recommendation systems to produce personal recommendations, for personalized aesthetic prediction. Following this line, Wang et al. [14] devised a deep aesthetic assessment model that integrates collaborative and attentive learning, while Korhonen [15] predicted personally perceived image quality by combining classical image feature analysis and collaboration filtering. In contrast to methods built upon collaborative filtering, Li et al. [29] designed a personality driven multi-task deep model for this purpose. Lee and Kim [30] used eigenvalue decomposition of a pairwise comparison matrix that involves multiple reference images and an input image. More recently, Zhu et al. [31] addressed the problem via meta-learning with bi-level gradient optimization, while Cui et al. [32] proposed to infer users' personal preferences based on their favoring behavior on social media platforms.

Learning to rank
Learning to rank has recently emerged as an attractive technique to train models for various vision and multimedia tasks. Yan et al. [33] trained a ranking model based on multiple additive regression trees for automatic image color enhancement. Paisitkriangkrai et al. [34] exploited learning to rank in person re-identification with metric ensembles. Liu et al. [35] used learning from rankings as a data augmentation technique for non-reference image quality assessment. Liu et al. [36] employed unlabeled data for crowd counting by learning to rank. In addition to the abovementioned application scenarios, learning to rank has also been applied to multi-label image classification [37,38].

Our approach
This section describes our personalized image aesthetic assessment approach. We first introduce how we collect pairwise preference rankings using personal images collected from a specific user. Next, we associate the collected personal images with AVAcurrently the largest public aesthetic analysis datasetand train a generic aesthetic model. Finally, we illustrate how we adjust the generic model with the pairwise rankings to accommodate personal taste, and consider implementation details. An overview of our approach is shown in Fig. 2.

Personal data collection
To collect the personal data, we first invited a user to share with us a small set of personal images. The user was asked to carefully selected the images to have diverse contents, styles, lighting conditions, and colors. Then, the user was asked to provide a pairwise preference for some randomly sampled image pairs based on the user interface shown in Fig. 3. Unlike previous methods which require the users to perform many pairwise rankings, we found that it is feasible to infer many useful rankings from user-annotated rankings with the Floyd-Warshall algorithm based on ranking transitivity. For instance, for three personal images I 1 , I 2 , and I 3 , if the user-annotated rankings are I 1 > I 2 and I 2 > I 3 , then we can generate I 1 > I 3 by transitivity. Note that each newly generated pairwise ranking is associated with only two userannotated rankings to avoid loops and to maintain reliability of the generated rankings. In other words, we do not generate I 1 > I 4 , even we have I 1 > I 2 , I 2 > I 3 , and I 3 > I 4 . Overview. Given a collection of a user's personal images, we first find each image's K-nearest neighbors (KNN) within the public AVA dataset. The discovered images and the corresponding aesthetic scores are then fed into a CNN-based regression network to train a generic aesthetic model. Next, we asked the user to state a preference in a few image pairs sampled from the personal image collection, and infered new rankings based on ranking transitivity. Finally, all pairwise ranking information is used to fine-tune the generic aesthetic model, turning it into a personalized aesthetic model that fits personal taste.

Fig. 3
User interface for collecting pairwise rankings. For each image pair, the user was asked to select the image that he/she prefers. The preferred image is ranked higher.

Generic image aesthetic regression
With the collected personal images, we regress a generic aesthetic model that numerically describes the universal visual preference for images of similar categories. To this end, we first perform a KNN search for each personal image within the AVA dataset. The discovered images and the corresponding aesthetic scores are then employed to train a generic aesthetic model via CNN-based regression. Below we describe the above two components, KNN image searching and CNN-based aesthetic regression, in detail.

KNN image searching
To obtain the KNN, we first obtain normalized feature vectors for each personal image and the images from the AVA dataset, based on VGG16 [39] pre-trained on ImageNet [40]. Next, we search for the KNN of each personal image from the AVA dataset by measuring the cosine distance between corresponding feature vectors. In our experiment, we empirically set K = 50, since it not only ensures that we collect sufficient training data for CNN-based aesthetic regression, but also allows more efficient network training. Figure 4 shows the overall network architecture of our CNN-based aesthetic regression network. Specifically, VGG16 is utilized to extract feature maps, which consist of 16 layers, 13 convolutional layers with small convolution filters of size 3 × 3, and 3 fully connected layers. To allow input of images of arbitrary size and back-propagation from aesthetic scores to original pixels, we remove the last three fully connected layers in the original VGG16 and add a max-pooling layer. For a given image, we first extract three feature maps Z 1 , Z 2 , and Z 3 from VGG16, which are then fed into two convolutional attention modules to get the attention maps A 1 and A 2 . Next, the predicted attention maps operate on the features in Z 1 and Z 2 via point-wise multiplication. This design is inspired by the physiological observation that local contexts typically play a more important role in visual preference evaluation at first glance. Finally, the attentive features are concatenated and fed into a fully connected layer with 10 neurons (annotated ratings are 1 to 10 for images in the AVA dataset) to predict the actual aesthetic score. Now we describe the training loss for the CNNbased aesthetic regression network. Each image in the AVA dataset is assigned a set of user ratings ranging from 1 to 10 in terms of empirical probability mass function p = [p 1 , . . . , p 10 ], 10 i=1 p i = 1, where p i , i ∈ [1,10] denotes the probability the image is labeled with aesthetic score i. Our goal is to predict the probability distribution of aesthetic scores for a Fig. 4 Network architecture of our CNN-based aesthetic regression network. Given an input image, we first send it into VGG16 to get three feature maps, i.e., Z1: 10th convolutional (conv) layer, Z2: 13th conv layer, and Z3: 13th conv layer with max-pooling. The feature maps (Z1, Z2, and Z 3 ) are then fed into two attention modules to get the attention maps A1 and A2, which are used to generate Z 1 and Z 2 by weighted combination with Z1 and Z2. Finally, we concatenate Z 1 and Z 2 to form the final feature representation, and employ a fully connected layer with 10 output neurons followed by the softmax function to predict the aesthetic score probabilities for the input image.

CNN-based aesthetic regression
given image. To regress a generic aesthetic model based on the discovered images and their associated rating annotations, we employ the Earth Mover's Distance (EMD) to formulate a loss for network training. It performs well due to its ability to penalize misclassifications according to class distance. Formally, the loss is defined as where N denotes the total number of discovered images. C p (j) = j i=1 p i denotes the cumulative distribution function.p denotes the probability mass function that we aim to estimate. is set to 2 to allow efficient optimization. Intuitively, the EMD-based loss measures the cost of moving the ground-truth distribution p to the estimated distributionp. The mean score obtained from the estimated distribution p is used as the output aesthetic score, i.e., 10 i=1 ip i .

Personalized fine-tuning with pairwise rankings
Having obtained the generic aesthetic model, the next step is to incorporate personal visual preferences by fine-tuning the generic model according to the collected pairwise rankings. To do so, we retrain the CNN-based regression network with rankings by using a pairwise ranking hinge loss defined as where x 1 and x 2 are a pair of images. θ denotes the network parameters. f (x 1 ; θ) and f (x 2 ; θ) represent the predicted aesthetic scores of images x 1 and x 2 . is the margin, which is set to 0.1 in our experiments. Following Ref. [35], we assume without loss of generality that x 1 has higher score than x 2 , so the gradient of the loss in Eq. (2) can be written as The above equation implies that when the predicted scores of the network are in accordance with the pairwise ranking, the gradient is zero. While the pairwise ranking is not met, the gradient of the image with higher score (x 1 ) is decreased and the gradient of the other (x 2 ) will be increased.

Implementation details
Our network was implemented in TensorFlow, and optimized by the Adam optimizer. For the CNNbased aesthetic regression, we trained for 10 epochs with a batch size of 32 and a fixed learning rate of 3 × 10 −7 . For the ranking-based fine-tuning stage, we trained for another 20 epochs with an initial learning rate of 5 × 10 −6 . An exponential decay of 0.5 was applied to the learning rate after every 500 iterations.

Experiments
In this section, we describe experiments used to validate the effectiveness of the proposed approach. We first introduce the test datasets and evaluation metrics, and then compare our method to stateof-the-art methods. Next, we provide an in-depth analysis of our approach. Finally, we showcase several applications enabled by our approach.

Datasets
The benchmark AVA dataset [8] and the REAL-CUR dataset [10] were employed to evaluate our method.
The AVA dataset consists of 255,000 images, each of which is aesthetically rated by an average of 210 users with scores ranging from 1 to 10. The REAL-CUR dataset contains 14 personal photo albums (each one including about 200 images), and each personal image is annotated with aesthetic score ranging from 1 to 5. To unify the range of scores to [1,10], the annotated aesthetic scores of the REAL-CUR dataset were doubled. The REAL-CUR dataset has the following two usages. Firstly, it provides the desired personal images for network training. Secondly, it can be used to verify the effectiveness of the learned personalized aesthetics. Specifically, we divided each album into two subsets, i.e., a set consisting of X images for network training and the other set containing the remaining personal images for testing. Then, we found the KNN for each image in the training subset from the AVA dataset to construct the regression training dataset. Next, we randomly selected 100 image pairs from each training subset, and got their pairwise rankings according to the annotated aesthetic scores (equal scores are discarded). Finally, the obtained pairwise rankings were enriched with the Floyd-Warshall algorithm, and we trained a personalized image aesthetic assessment model based on the regression training dataset and the collected pairwise rankings.

Evaluation metrics
Akin to prior methods [9,31], ranking correlations are used to measure consistency between predictions and ground truth user scores. Specifically, we employed the Spearman rank-order correlation coefficient (ρ) [41] to quantitatively evaluate the performance of personalized image aesthetic assessment. It is defined as where r j denotes the rank of the jth test image when sorting the ground truth aesthetic scores in descending order, while r j denotes the rank given by the predicted aesthetic scores. M is the number of test images. The value of ρ ranges from −1 to 1, and a higher absolute value indicates stronger correlation and better overall performance.

Method
We compare our method with eight existing methods, including: NIMA [42], MPADA [20], MLSP [43], FPMF [44], PAM [10], as well as three other rankingbased methods: R-SVM [45], R-SVR [11], and RankIQA [35]. Note, the original RankIQA collects rankings by randomly distorting the input images for image quality assessment. To make it fit our task, we replaced their ranking data with our collected rankings. For fair comparison, we retrained the compared methods based on the images discovered from the AVA dataset and the collected pairwise rankings, using the publicly-available implementation provided by the authors with recommended parameter settings. We implemented R-SVR ourselves since there is no publicly available implementation. We did not compare with Ref. [14], since it relies on both personal ratings and image reviews for model training. Our comparison is twofold. Table 1 reports a quantitative comparison of our method with the other methods, using 10 (X = 10) and 100 (X = 100) training images, respectively. The mean ranking correlation of all 14 personal photo albums in REAL-CUR is shown in Table 1. As can be seen, directly learning the personal visual preference from very limited training data via naive regression (NIMA) or collaborative filtering (FPMF) results in poor generalizability to unseen test images. PAM produces very competitive results by simultaneously considering content and aesthetic attributes. Pairwise rankings are adopted in the three compared rankingbased methods (R-SVM, R-SVR, and RankIQA), yet our method outperforms them, showing that our method not only effectively learns personalized aesthetics from very limited data but also generalizes well to unseen personal images. Figure 5 compares personalized aesthetic scores predicted by our method and the comparative methods on some example test images from the REAL-CUR dataset. As can be seen, our personalized aesthetic assessment model more accurately predicts the user's ratings.

User study
As personalized aesthetic assessment is highly subjective, we further conducted a user study with 4 users (2 males and 2 females) to evaluate our method.
To this end, we first collected four personal image datasets from the users, covering a broad range of scenes, subjects, and lighting conditions. The four personal datasets are referred to as PD1, . . . , PD4, and each contains 200 images. We then randomly selected 150 personal images from each dataset and collected 220 pairwise rankings among these images from the corresponding user, while the remaining 50 personal images were reserved for testing. Next, we trained personalized aesthetic assessment models using our approach and the other three ranking-based methods (R-SVM, R-SVR, and RankIQA), and used the trained models to predict aesthetic scores for all testing images. To assess performance, we randomly selected image pairs from the test images and showed the corresponding user the personalized aesthetic scores predicted by different methods, and asked the user to judge whether the rankings indicated by the predicted scores were consistent with his/her personalized visual preference. Table 2 summarizes the percentage of pairwise rankings predicted by different methods that are consistent with the specific personalized user preference. We can see that our predicted aesthetic scores better match the user's preference. Figure 6 shows some example results for image pairs employed in the user study.

Ablation study
We also quantitatively evaluated the effectiveness of the CNN-based regression, the learning from ranking  5 Comparison of personalized aesthetic scores S predicted by our method and state-of-the-art methods, for some test images from the REAL-CUR dataset. Above: test images. Below: personalized aesthetic scores predicted by different methods and ground truth scores given by the image owner. The predicted aesthetic score closest to the user-labeled score is highlighted by a gray beckgroun. design, and the attention mechanism in the regression network. Comparing the 2nd row and 13th row in Table 1, we observe a clear advantage of our regression network over a baseline regression network (NIMA). Moreover, in addition to our utilized pairwise ranking hinge loss, we also tried two other commonly used alternatives, exponential loss and logistic loss [46]. As shown, using the same regression network, the hinge loss achieved better results than the other two losses, convincingly demonstrating its effectiveness. As can be observed by comparing the 12th and 14th rows with the 15th row, omitting the attention module from the regression network and the pretraining on AVA leads to an obvious decrease in overall performance, demonstrating that they are beneficial to learning personalized aesthetics.

Limitations
Our method may fail to accurately predict personal preferences when the collected personal images are severely imbalanced in subjects and scenes. For instance, when most personal images belong to a single category (e.g., indoor images), our method may fail to predict the individual's preferences for other kinds of images (e.g., portraits).

Applications
Our approach can be applied to personalized image retouching to better meet users' personalized tastes.
To do so, we designed an aesthetic quality loss , where x and f (x) denote the retouched image and the predicted personalized aesthetic score (f denotes our trained personalized aesthetic assessment model). Intuitively, this loss enforces the score of the retouched image to be as close to the maximum (10) as possible. By incorporating the loss for training a specific learning-based image retouching framework, we can achieve personalized image retouching. Figures 7-9 show the use of our learned personalized aesthetic for a user who favors bright scenes, vivid colors, and clear details in image retouching tasks of exposure correction, color enhancement, and image dehazing. As shown, incorporating personalized aesthetics produces results which better satisfy the user's preferences.

Conclusions
We have presented a novel approach for personalized image aesthetic assessment. Unlike previous methods that are mostly based on user-annotated absolute aesthetic ratings, we distill an individual user's visual preference by joint regression and learning from pairwise rankings, which not only allows more accurate aesthetic learning, but also remedies the lack of labeled data. We first collect a small set of personal images and find their K nearest neighbors from the benchmark AVA dataset, and then train a generic aesthetic model with the discovered aesthetic labeled images. Next, we adjust the generic model to accommodate personal taste by incorporating user annotated ranking information. Experiments demonstrate the effectiveness of our method. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.