Advertisement

SN Computer Science

, 1:58 | Cite as

A Deeper Look at Human Visual Perception of Images

  • Shaojing Fan
  • Bryan L. Koenig
  • Qi Zhao
  • Mohan S. KankanhalliEmail author
Original Research

Abstract

How would one describe an image? Interesting? Pleasant? Aesthetic? A number of studies have classified images with respect to these attributes. A common approach is to link lower level image features with higher level properties, and train a computational model to perform classification using human-annotated ground truth. Although these studies generate algorithms with reasonable prediction performance, they provide few insights into why and how the algorithms work. The current study focuses on how multiple visual factors affect human perception of digital images. We extend an existing dataset with quantitative measures for human perception of 31 image attributes under 6 different viewing conditions: images that are intact, inverted, grayscale, inverted and grayscale, and images showing mainly low- or high-spatial frequency information. Statistical analyses indicate varying importance of holistic cues, color information, semantics, and saliency on different types of attributes. Building on these insights we build an empirical model of human image perception. Motivated by the empirical model, we designed computational models that predict high-level image attributes. Extensive experiments demonstrate that understanding human visual perception helps create better computational models.

Keywords

Visual sentiment Empirical modeling Computational modeling 

Introduction

Automated assessment of high-level image properties has many commercial applications, such as in monitoring social media, facilitating targeted advertising, and understanding user behavior. The analyses of visual aesthetics have been used in commercial businesses such as assessing the quality of handwriting [1, 2, 3]. Predicting media interestingness has been applied in assessing educational and advertising effects [4, 5]. A plethora of studies have focused on predicting high-level image attributes, such as aesthetics [6, 7, 8], sentiment [9, 10], and memorability [12, 13]. However, existing research on computational visual perception is rather insular, and a theoretical gap separates high-level image properties and low-level computer features. Although convolutional neural networks are becoming more commonly used in this field [14, 15], such research has provided few insights into why the learned features represent subjective human perceptions. Indeed, human visual perception is subjective, implicit, and multi-dimensional, which is why it is difficult to gain deep understanding of human visual perception based on standard models and approaches from the computer vision field.

The current study empirically identifies factors underlying human perception of digital images and incorporates them into computational models. Our research builds on findings in multiple disciplines. Research on parallel processing modules in human visual system suggests that both piecemeal (dealing with local features) and holistic (utilizing global features) processes are important for object recognition [16, 17, 18]. Research on multiple-channel notion suggests that the visual cortex cells are two-dimensional spatial filters tuned for processing information at different spatial frequencies [19, 20, 21, 22].

Motivated by human perceptual characteristics, in our work, we define local image features as the features originating from visual saliency (e.g., the entropy of an image’s visual saliency map). We define global image features as features that provide a holistic description of an image (e.g., the mean value of the H, S, V channels of HSV color space). We defined spatial frequency (SF) features as the statistics of the coefficients after 2-D discrete wavelet transform of the image. To analyze the effects of these different features, we present a study focused on human image perception by leveraging human psychophysics, inferential statistics and computer vision science (see Fig. 1 for an overview). More specifically, we first extend an existing dataset for human visual perception analysis. The enhanced dataset includes 400 images of a wide range of semantics and types (photograph, computer graphics, matte paintings) with 6 different versions (see Fig. 2), as well as human annotations of an extensive list of attributes under the 6 conditions. Through a series of human behavioral studies and statistical analyses on the extended dataset, we show that color, local and global information, and spatial frequency each relates to distinct sets of high-level attributes during human perception. Our analyses provide insights into processes underlying human perception of images (see Fig. 1a, b). We build an empirical model on the processes underlying human perception of images. Based on the empirical model, we propose a computational model that effectively integrates human visual characteristics to predict high-level human perceptions and reactions (see Fig. 1c).
Fig. 1

a We first perform a comprehensive human psychophysics experiment building on prior findings on the human visual system. b We then conduct detailed statistical analyses on the correlation between local, global, and spatial frequency information and 31 high-level image attributes. c Finally, we propose a computational model that takes into account our key psychophysics insights and predicts multiple high-level image attributes

Fig. 2

Sample images of six viewing conditions in our human studies. LSF and HSF represent low spatial frequency and high spatial frequency, respectively

Our findings could be of interest to multiple disciplines. First, our study represents one of the first attempts to apply classic experiment paradigms from psychology in the computer science domain. There is rich literature in the field of psychology and neuroscience for human behavioral characteristics analyses, and our work demonstrates how to utilize the existing knowledge in multiple disciplines in computer vision algorithm design. Second, our human findings will help mass media practitioners and artists in designing emotion-eliciting visual content. It provides insights, for example, into choosing between grayscale and colored stimuli [23, 24] for advertisement design, and for redaction of emotion-eliciting information in privacy-preserving applications. Finally, our computational modeling results suggest that a deep understanding of human visual characteristics helps create better computational models. It provides insights into the design of computer vision algorithms regarding how knowledge about human visual processing can be included to boost performance.

Related Work

Predicting High-Level Image Attributes

Research on image assessment has largely focused on predicting high-level image attributes, such as image memorability [13, 25], aesthetics [7, 8, 26, 27], interestingness [28], popularity [29], visual realism [30, 31], and sentiment [9, 10, 32, 33]. Towards the goal of improved prediction, these studies use a list of image features known to influence the specific perception of an image. However, few insights have been provided into why the features predict those perceptions. Recently, deep neural networks (DNNs) have been increasingly used and have demonstrated superiority in predicting memorability [12, 34], visual realism [14] and visual sentiment [15, 35]. But training a DNN requires a large-scale dataset with human ground-truth annotations. To get such big human data is often expensive and time consuming. In our work, we explore high-level image perception from a psychophysics perspective and incorporate our findings to build a computational model for the prediction of high-level image attributes. The proposed computational model is novel in a sense that it mimics the characteristics of human visual perception.

Characteristics of Human Visual Perception

In the field of human vision, the holistic–piecemeal theory has been studied for a century, beginning with the debate between Structuralists, who championed the role of elements, and Gestalt psychologists, who argued that the whole was different from the sum of its parts [36, 37, 38, 39]. In face perception studies, most researchers agree that humans are using two parallel pathways—holistic and piecemeal—in context such as for face identification and expression recognition [16, 17, 40]. The holistic–piecemeal theory is later extended to the perception of general objects, as well as scenes [18, 41, 42, 43, 44]. Some researchers focus on the contribution of different spatial frequency channels to visual perception and find that various visual cortex cells are tuned to different spatial frequencies [19, 45, 46, 47, 48]. Moreover, viewed from a timescale perspective, human visual perception is commonly regarded as a coarse-to-fine process [49, 50]. Our psychophysics experiment takes into account these findings from psychology and neuroscience. We make one of the first attempts to apply the classic experiment paradigms from visual psychophysics in the field of computer vision research, to explore quantitatively how these underlying processes influence human perception of high-level image attributes.

A few studies have tried to incorporate human visual characteristics into computational learning, such as in training neural networks for object recognition using blurred images [51], and human attention prediction in low-resolution or noised images [52, 53]. In our work, we make a deeper exploration of human visual characteristics for image perception, and computationally model these characteristics in a wider set of high-level image attribute prediction.

Attention and Emotion

Humans are known to have selective attention to visual stimuli [54, 55]. Selective attention is the process of focusing on a particular object in the environment for a certain period of time. Attention is a limited resource, so selective attention allows us to tune out unimportant details and focus on what really matters. Previous studies [56, 57, 58] have explored the relation between human attention and visual sentiment, and found that humans attend to emotional objects more than emotionally neutral objects. A few computational models have used the visual saliency information in predicting image attributes, such as visual aesthetics [8, 59], visual memorability [60, 61, 62], and visual sentiment [11]. A common approach is to add a sub-network that computes image saliency to a DNN model. Our work is clearly distinct from the previous studies, as we focus on human behavioral characteristics on how visual saliency influences human image perception, and systematically investigate how saliency features contribute to multiple high-level image attribute prediction.

Proposed Method

In this section, we introduce the methodology used for our study. We describe how we construct our dataset and collect human data.

Dataset Construction and Stimulus Manipulation

We propose an enhanced image dataset, in which each image has quantitative annotations of human perception of 31 high-level image attributes under different conditions. Human annotations are collected in a psychophysics experiment on Amazon Mechanical Turk (MTurk) [63]. We extend an existing dataset proposed in our previous work [64]. We choose this dataset as it is a very comprehensive image set, with (1) 31 human-annotated attributes on each original image, (2) 31 human-annotated attributes on the focal object (i.e., the object with the highest saliency score in an automatically computed saliency map [60]) of each original image, and (3) extensive object labels annotated by LabelMe [61]. The 31 attributes represent a comprehensive set of human perceptions of visual content, which include image aesthetics, semantics, spatial layout, and image sentiments (see Table 1).
Table 1

List of 31 human-annotated attributes in our human studies

Attributes type

Detailed attributes

Aesthetics

Aesthetic? High quality vs. low quality; sharp vs. blurry; expert photography? Attractive to you? Colorful? Harmonious color?

Semantics

Contain fine details? Dynamic or energetic scene? Have storyline? Object appearance natural? Naturally occurring objects combinations? Familiar with the scene? People present?

Spatial layout

Clean scene? Close range vs. distant view; have objects of focus? Single focused? Neat Space? Empty space vs. Full space? Common perspective? Centered?

Naturalness

Lighting effect natural? Natural color? Appears to be a photograph rather than computer generated?

Emotion

Makes you happy? Exciting? Interesting? Makes you sad? Unusual or Strange? Mysterious?

To evaluate the importance of multiple image factors, we manipulate the images using the following methods that are commonly applied in psychology, neuroscience, or computer vision [13, 62, 63, 64]: (1) decoloring—to leave only intensity information by removing color; (2) inversion—to disrupt global/configural information (local information may also be disrupted by not as much as global information); (3) both decoloring and inversion; (4) Fourier image decomposition to extract low- and high-spatial frequency information.

For (4), we follow the same method as [64] to obtain LSF and HSF images. Original images are Fourier transformed and multiplied by low-pass and high-pass Gaussian filters to preserve low (below 8 cycles/images) and HSF (above 24 cycles/image), respectively. We then inverse-Fourier transformed the product and rescaled the values to the full 8-bit range. We maximize the difference between LSF and HSF conditions by choosing a cut-off frequency of 16 cycles/image. In line with the previous works contrasting categorization performance in LSF and HSF stimuli (e.g., [63, 65]), LSF and HSF conditions exclude intermediate spatial frequencies (here 8–24 cycles/image) [66, 67, 68, 69, 70].

In total, each image in our stimulus set has 6 versions: color_upright (original), color_inverted, grayscale_upright, grayscale_inverted, showing mainly low-spatial frequency (LSF) information, and showing mainly high-spatial frequency (HSF) information (see Fig. 2 for examples).

Human Psychophysics Study

We presented workers on MTurk with a series of image annotation tasks. Each participant saw only one version of images (e.g., original or color_inverted). Figure 3 shows a screen shot of the user interface of color_upright (original) viewing condition. The interfaces were the same for other viewing conditions except the images on the left were modified as appropriate for each condition. We used the same attribute list as in [26] to be annotated (see Table 1). Each image is rated by 9 MTurk workers for all of the attributes (see Supplementary Material for the detailed questionnaire). The average response across the 9 ratings for each attribute is normalized between 0 and 1, and stored as the attribute value for an image. The mean of each human-annotated attribute on all six versions are reported in the Supplementary Material. The data with all annotations are available online at the project homepage https://ncript.comp.nus.edu.sg/site/ncript-top/sentiment/.
Fig. 3

The user interface for the image attribute annotation task shown to workers on Amazon Mechanical Turk. In this case, the image is presented in color_upright viewing condition

Measures for Ensuring and Evaluating Data Reliability

We used three measures to ensure our data quality. First, we invited participants with only approved rate higher than 95% from MTurk’s system. Second, we inserted two questions in the questionnaire to check whether the participants are clicking the buttons randomly. The two questions are “Are you serious in doing this survey?’’ and “Are you providing answers randomly?” Third, we filtered out the submissions that were completed too quickly or too slowly (trimming the top and bottom 5% based on task completion time). Previous work has shown that filtering based on completion velocity filters out low-quality submissions [71, 72].

To evaluate data reliability, we performed two experiments to assess within- and across-group consistency in human annotation. First, we used bootstrapping to randomly form two subject groups, that is, we randomly selected nine data points (using sampling with replacement) from all the annotations per image to form an observation of one participant group, and repeat this to create another group. We quantified the degree to which each attribute score for the two sets of participant groups was in agreement using Spearman’s rank correlation (ρ). We computed the average ρ over 25 bootstrapping iterations (referred to as within-group correlation). Overall there is a moderate consistency among all attributes across all viewing conditions. We further computed the correlation of our annotations with the annotations published in [64] (referred to as across-group correlation). Although the two sets of annotations were collected during different periods of time with different experiment settings, there is still a statistically significant correlation, indicating that humans are moderately consistent in image perception (see Fig. 4).
Fig. 4

Within-group a and across-group b human annotation consistency for the six conditions

To conclude, similar to other subjective image properties such as memorability, visual sentiment is likely to be influenced by the user context. However, despite this expected variability when evaluating subjective properties of images, we show a sufficiently large degree of consistency between different users’ judgments, suggesting it is possible to devise automatic systems to estimate visual sentiment directly from images, ignoring user differences. One of our future works is to design personalized approaches to predict both human sentiment and individual differences [73, 74].

Empirical Modeling

In this section, we report results of the statistical analyses on the human data, highlighting important observations, and then model our human findings empirically.

Methods and Definitions

Multi-level modeling (MLM)—a statistical modeling method commonly used in psychology and social sciences—was used to analyze the human data whose collection was described in the previous section. MLM is especially useful in multi-level data (aka. nested data, in which observations are nested with levels of another variable), resulting in a hierarchical structure in the data [75]. Readers can refer to the Supplementary Material for a detailed introduction of MLM. In our study, the score of each image attribute is nested by two levels (lower level: individual human rating; higher level: individual image) and influenced by two factors (factor A: viewing condition; factor B: image content). Our modeling considered all of these aspects and their interactions. We performed a series of mixed-model ANOVAs, followed by post hoc Tukey tests with Bonferroni correction. Readers can refer to [76] for an overview of these inferential statistical methods.

We use different types of features to aid our human data analysis. Again, we define local image features as the features originated from visual saliency (e.g., the entropy of an image’s visual saliency map). We define global image features as features that present a holistic description of an image (e.g., the mean value of the H, S, V channels of HSV color space). We defined spatial frequency (SF) features as the energy after 2-D discrete wavelet transform of the image.

Results of Human Psychophysics Studies

We analyze the impact of six viewing conditions (color_upright, grayscale_upright, color_invert, grayscale_invert, showing mainly LSF or HSF information). We first report key observations, followed by discussions and comparisons with the existing literature. For all results mentioned, the detailed statistics are reported in the Supplementary Material.

First of all, we found that both viewing condition and image content significantly affected all attributes (ρs ≤ 0.001), indicating that image semantics, image inversion, decoloring, and showing only the focal object influenced human perceptions.

Second, we found that color increased positive visual sentiments. Decoloring of upright images significantly reduced the scores of emotional attributes “exciting”, “make happy”, “attractive”, and “interesting” (see Fig. 5), suggesting that color increases positive visual sentiments. This accords with previous findings in psychology [77, 78], which show that color is important for human positive sentiment perception. Notably, color does not significantly affect the attribute “aesthetic”, which implies that the perception of image aesthetics is not largely dependent on color. In fact, previous literature has shown that image aesthetics are determined by multiple factors such as image layout (e.g., “Rule of thirds”), image contrast, and image semantics [7, 79, 80].
Fig. 5

Comparison of human ratings on positive sentiments among different viewing conditions. Error bars represent standard error of the means. The asterisks are denoted as follows: *P < 0.005, **P < 0.001, ***P < 0.0005. For clarity, only selected comparison results are denoted with asterisks

Our results further indicate that holistic information, especially holistic color information, increased positive visual sentiments. Inversion significantly lowered the scores of “exciting’’, “aesthetic’’, “interesting’’, and “high quality’’ on colored images, but not on grayscale images (see Fig. 5). As inversion is known to impair the holistic human perception [81, 82], this suggests that holistic color information increases positive sentiments.

Furthermore, HSF arouses more positive sentiments than LSF. Compared to LSF, HSF images are rated higher on positive attributes, such as “exciting”, “attractive”, “aesthetic”, and “make happy” (see Fig. 5), indicating the importance of high spatial frequencies on positive sentiments. LSF images have similar ratings on “strange” and “mysterious” as original images (ρs > 0.1). This complements findings in [83], which show higher energy (i.e., sum of mean squared value of coefficients in horizontal, vertical, and diagonal coefficients) in the LSF for emotional pictures than in neutral pictures. Here we found that HSF contains more information than LSF for positive sentiments, and the negative information is mainly carried by LSF.

Our MLM analyses also indicate that semantics influenced all attributes. Object type significantly impacted visual perception. Interestingly, natural objects arouse as much excitement as faces (see Supplementary Material for detailed analyses).

Previous work indicates that visual saliency influences human emotion perception [57, 58, 64, 84, 85]. To better understand how visual saliency affects image perception, we extracted some features from the focal object in each image and analyzed their correlation with visual attributes. We found that low-level color features of focal objects are more influential than that of whole image on attributes “attractive”, “colorful”, “make happy”, “make sad”, “close view” and “neat space”. Simple low-level image features are traditional predictors of visual saliency [86]. We computed the mean and standard deviation (STD) of HSV color channels of the focal object. For comparison, we also computed the same color statistics on the whole image. We calculated the Spearman’s rank correlation between color features and attribute scores. The overall impact of basic color statistics are low, but the correlation of color features of focal object is higher than that of global features on “attractive”, “colorful”, “make happy”, “make sad”, “close view” and “neat space” (see Fig. 8), suggesting that if left with only low-level color features, the perception of these attributes will still largely be present in the focal object.

Interestingly, in terms of human consistency, inversion (of color images) had the highest consistency in human perception (see Fig. [4]). When a color image is inverted, the cognitive recognition process is disrupted. People can no longer match the inverted image with prior templates (which are upright) [18, 87]. Thus, their judgments rely more on low-level features (e.g., color and edges), and we believe depend less on individual experience. If so, higher consistency in human annotations resulted from the removal of idiosyncratic reactions. However, when an image is inverted and at the same time decolored, people will have less low-level features to use (no more low-level color features); thus, human consistency in this condition was lower than for color_inverted.

Unified Interpretation of Human Sentiment Perception

Here, we propose a model for the process of human visual sentiment perception that is consistent with our major findings. The model is an integrative hypothesis about the modes of processing of image features relevant to visual sentiment. The chief purpose of the model is to provide readers a unified framework for comprehending the relatively large set of results from our psychophysics experiments.

As such, we constructed our model to be consistent with our major experimental results. Therefore, the model is empirical, in the sense that it is based (qualitatively) on data. It was designed to rely on simple and biologically plausible computations (e.g., template matching).

The model, shown in Fig. 6, is broadly divided into “piecemeal” and “holistic” pathways. Both piecemeal and holistic pathways perform essentially the same operation: matching a given image patch to templates that have been learned through experience with other objects. The key difference between the pathways lies in the templates. Piecemeal templates are detailed and small in size, while holistic templates are coarse and cover multiple image parts. The template-matching outputs, which range between 0 and 1, are then combined in a weighted manner. The weights (denoted Wi in Fig. 6) can be positive or negative, and are learned through prior experience. The final model output (i.e., the weighted combination) indicates the level of a sentiment.
Fig. 6

Perceptual model of human visual sentiment (taking the perception of “attractiveness” as an example). Human perception of digital images goes through both piecemeal and holistic pathways, and is based on a template matching mechanism. The outputs from various pieces of information are combined in the weighted manner to elicit different levels of visual sentiments

Because majority of templates are derived from human prior experience of interacting with upright scenes, when images are inverted, the template-matching outputs become smaller, which causes the magnitude of the final ratings to decrease. This results in a systematically smaller difference among human ratings (i.e., higher consistency among sentiment perception). The colored templates are more affected by inversion, and the grayscale templates are affected less (see [88] for a review of the relationship between holism and inversion). Thus, inversion significantly increases human consistency of color images, but not on grayscale images.

As shown in the model, colored features in both piecemeal and holistic pathways contribute predominantly to positive sentiments. This accounts for the finding that color and HSF information increased positive ratings.

When the image resolution is reduced (e.g., in LSF viewing condition), fine details of the image are lost. This affects piecemeal processing severely, and the contribution of the piecemeal pathway to the weighted summation is reduced, ultimately leading to lower rating of positive sentiments. This is consistent with the findings that semantics and visual saliency significantly affect visual sentiment, and HSF arouses more positive sentiments than LSF.

The model is able to qualitatively account for all of the main effects found in the human psychophysics experiments, but it has a number of limitations. For example, the details of the weighted combination are not specified. Should the combination be more OR-like (additive) or more AND-like (multiplicative)? Is the combination same for all types of sentiments? Is there a single combination step, or a hierarchy of OR-like and AND-like combinations [87]? Some of the previous studies [64, 89] suggest that, at least for the piecemeal templates, the outputs may be combined in an AND-like fashion. This means that as long as even one single part is deemed negative, the whole image is perceived to be negative. Furthermore, the model does not specify how the template matching is performed: is the template matching for semantics occur before sentiment perception, or do they happen simultaneously?

The model also makes various simplifying assumptions. For example, in practice, it is possible—even likely—that there is a continuum between piecemeal and holistic processing. The relative contributions of piecemeal and holistic processing are likely to be influenced by many factors such as context, priming effects, and so on.

Our model structure, built on images of general scenes, is tantalizingly reminiscent of psychological models of human perception of faces [7], suggestive of common processing mechanisms in visual perception [18, 90, 91, 92]. Although our model is currently qualitative, it lays the groundwork for follow-up studies to specify parameter values for a quantitative computational model. Often models for cognitive processes that start as qualitative inspire future research toward neural computational models [93]. In the following sections, we try to validate our perceptual model computationally. We also plan to flesh out our model through fMRI experiments in the future.

Computational Modeling

Here we design computational features motivated by our human studies. We built a computational model for quantitative sentiment assessment based on these features.

Feature Design

Global Features

We have shown in the previous sections that human image perception is influence by both local and global information. So we included a suite of global image descriptors that have been previously found to be effective in some high-level image attribute predictions [9, 10, 31, 94] and scene recognition tasks [95]. The global features we used are (i) mean and std of HSV color channels, (ii) GIST [75], and (iii) HOG 2 × 2 [96, 97, 98].

Local Features

We define local features as the features originated from visual saliency. As shown in “Results of Human Psychophysics Studies” and “Unified Interpretation of Human Sentiment Perception”, human visual perception is partly or largely presented in focal object, and significantly affected by image semantics. Based on such human visual characteristics, we design the following types of local features.

First, to model the patterns of visual attention, we compute the entropy of the saliency map (generated using SALICON [65]) of each image. Entropy is a statistical measure of randomness. It is lower if the corresponding image attracts more focused attention. Figure 7 shows the distribution of entropies in all images (0.69 ± 0.25). We expect this feature would predict attributes related to human attention, such as “have focused object” and “close view”. We then include a fully automated object detector Object Bank to encode salient objects in an image. Object Bank (OB) is an automatic object detector that models presence of a pre-defined set of objects [99]. A set of other studies have used object features in high-level image attribute prediction [8]. We further use an automated method for predicting locations of salient regions in an image [100]. Once we have the focal region, we compute the relative region size and distance to the image center to encode the spatial layout of the image. We also compute basic color statistics of the region. The above features are concatenated to form the local feature (hereafter referred to as “local basics”).
Fig. 7

Distribution of saliency entropy of all images (a). Images (with saliency map on the right) of highest (b) and lowest (c) saliency entropy

Spatial Frequency Features

As discussed in observations 4 and 5 in “Methods and Definitions”, both LSF and HSF carry important information, each for distinct set of image sentiments and semantic attributes. Our initial thought was to compute the LSF and HSF directly from a discrete wavelet transform. We applied a four-level and two-dimensional Haar wavelet decomposition and obtained the mean and standard deviation of coefficients of the first band as LSF feature, and those from the second to fourth bands as HSF feature. However, they did not perform well in predicting most of the attributes (ρs < 0.26), suggesting that wavelet coefficients alone are insufficient for high-level attribute prediction. Thus, we develop an alternative way to extract LSF and HSF features based on natural image statistical models. Statistical models can be used to represent regularities inherent in natural images [101]. For example, high-contrast local image patches that mainly correspond to the edge structures display some regular patterns. This motivates us to use gradient information in modeling different frequency information. Let I(x,y) denote the image intensity, we compute the surface gradient of the image intensity with a scaled constant α as follows:
$$\begin{aligned} \left| {{\text{grad}}\left( {\alpha I} \right)} \right| = \sqrt {\frac{{\left| {\nabla I} \right|^{2} }}{{\alpha_{ - 2} + \left| {\nabla I} \right|^{2} }}} \hfill \\ {\text{where }}\left| {\nabla I} \right| = \sqrt {I_{x}^{2} + I_{y}^{2} } . \hfill \\ \end{aligned}$$
(1)

The constant α is to control the weight of emphasis on the low-gradient region versus the high-gradient region. We compute the gradient on R, G, B channels at every pixel of the LSF and HSF images from the psychophysics experiment with α equal to 0.25. We use spatial pooling to reduce the dimension to 98 in the final algorithm. The results from LSF and HSF images, respectively, are used to represent LSF and HSF features.

Results of Computational Modeling

We use the designed features for training support vector regressor (SVR) [102] to predict the 31 image attributes. As we have multiple types of features, and different features have different contribution to distinct set of attributes, we use weighted kernel sum [94] to fuse the multiple features by automatically learning their weights for specific attributes.

The experiments are performed on the 400 images with human-annotated ground truth on 31 attributes. An SVR is trained with each of the contextual features of each attribute. We use a grid search to select cost, RBF kernel parameter γ, and ϵ hyperparameters. We split the data into 80% as a training set and 20% as a test set. A fivefold cross validation is used. The results are evaluated by their Spearman’s rank correlation with ground-truth human annotations.

The results are reported in Table 2 and Fig. 8. A visualization of all prediction results is shown in the Supplementary Material. We make the following observations:
Table 2

Predictions (Spearman’s rank correlation) on various attributes based on local, global, spatial frequency features, and our own method that combines all the feature channels

Feature type

Local features

Global features

SF features

Ours

Attributes

Saliency entropy

Salient objects

Focal region

Global HSV

GIST

LSF

HSF

Colorful

0.62

0.20

0.20

0.58

0.20

0.54

0.54

0.65

Close view

0.31

0.59

0.22

0.27

0.45

0.39

0.45

0.60

Single focus

0.26

0.38

0.35

0.08

0.32

0.15

0.17

0.43

Focused object

0.26

0.43

0.35

0.04

0.24

0.10

0.20

0.41

Centered

0.25

0.33

0.31

0.05

0.23

0.20

0.17

0.38

Sharp

0.27

0.24

0.25

0.28

0.08

0.16

0.25

0.37

People present

0.18

0.32

0.20

0.12

0.28

0.21

0.10

0.41

Object combo

0.30

0.27

0.14

0.27

0.29

0.27

0.30

0.35

High quality

0.29

0.28

0.23

0.25

0.06

0.12

0.26

0.34

Clean scene

0.13

0.21

0.15

0.17

0.10

0.14

0.17

0.33

Interesting

0.18

0.21

0.22

0.21

0.12

0.10

0.21

0.30

Exciting

0.20

0.28

0.23

0.26

0.12

0.14

0.25

0.30

Natural lighting

0.24

0.15

0.07

0.24

0.13

0.24

0.20

0.30

Dynamic

0.14

0.22

0.21

0.18

0.11

0.14

0.18

0.29

Neat space

0.15

0.22

0.19

0.10

0.21

0.11

0.20

0.28

Empty space

0.12

0.15

0.16

0.14

0.24

0.19

0.22

0.28

Make happy

0.25

0.14

0.14

0.23

0.07

0.14

0.23

0.27

Attractive

0.26

0.26

0.17

0.18

0.15

0.11

0.23

0.27

Expert photo

0.12

0.07

0.28

0.17

0.08

0.12

0.25

0.27

Color harmony

0.28

0.21

0.13

0.25

0.01

0.14

0.10

0.26

Contain fine details

0.14

0.23

0.18

0.13

0.17

0.16

0.24

0.25

Mysterious

0.16

0.22

0.12

0.16

0.13

0.20

0.08

0.25

Have storyline

0.10

0.23

0.13

0.07

0.18

0.10

0.11

0.24

Natural color

0.22

0.17

0.12

0.21

0.01

0.11

0.11

0.23

Strange

0.13

0.19

0.05

0.08

0.13

0.18

0.15

0.21

Aesthetic

0.15

0.19

0.20

0.10

0.09

0.09

0.15

0.20

Natural perspective

0.15

0.16

0.11

0.17

0.03

0.19

0.13

0.19

Natural object

0.15

0.19

0.15

0.16

0.11

0.19

0.16

0.19

Familiar scene

0.11

0.17

0.10

0.08

0.12

0.11

0.19

0.19

Photorealistic

0.11

0.10

0.12

0.09

0.04

0.07

0.01

0.14

Make sad

0.10

0.13

0.10

0.09

0.12

0.14

0.11

0.15

Mean

0.17

0.18

0.15

0.20

0.18

0.17

0.20

0.30

Bold font represents the highest performance on each attribute

Fig. 8

Top ranked images from our computational prediction. Our model successfully learned the high-level concepts such as image sentiments, semantics, and image spatial layout, suggesting the effectiveness of our features which are motivated by the insights of human visual perception

  1. 1.

    Our model, which integrates human visual characteristics, has an overall higher performance than those using local, global, or spatial features alone, suggesting that integrating the various channels from human perception helps boost the performance.

     
  2. 2.

    Local features significantly outperform global and spatial frequency features on attributes that relate to image focus: “have focused object” and “centered”, and slightly outperform on “strange” and “make sad”. This echoes our findings in human psychophysics study.

     
  3. 3.

    HSF features outperform LSF features on positive sentiments (e.g., “make happy”, “exciting”) and semantic attributes (e.g., “familiar scene”, “dynamic”), whereas LSF moderately outperforms HSF on negative sentiments (e.g., “make sad”, “strange”). These results are consistent with our findings in human studies (see “Results of Human Psychophysics Studies”)

     

Applications

To test the generalizability of our model, in this section, we apply our model to new data sets and attributes. We perform a series of experiments on two new data sets: (1) the International Affective Picture System (IAPS) [103], and (2) the Twitter Dataset [10].

Emotion Categorization on IAPS Dataset

The IAPS is a common image set widely used in emotion research. It consists of natural color photos depicting complex scenes containing portraits, animals, landscapes, and so on. We use 389 of these images, selecting those previously coded as primarily eliciting only one of these 8 discrete emotions: anger, disgust, fear, sadness, amusement, awe, contentment, and excitement [104].

During experiments, we use support vector machine (SVM) [102] to assign each image to one of the eight emotion classes. We follow [4] to perform one-vs-all classification. The SVM settings are the same as described in “Results of Computational Modeling”, except here we use binary classifications instead of regression. We use area under ROC curve (AUC) as the evaluation metric. The classification results are reported in Fig. 9. We compare our model with two other baseline methods dedicated to emotion assessment: (1) SentiBank—a 1200-dimensional sentiment detector for various sentiments prediction [10]; (2) Machajdik’s model—a model for affective image classification [9].
Fig. 9

Classification performance for IAPS data set compared against other methods

As shown in Fig. 9, our model performs overall best compared with both [9] and [10] on the IAPS data set. This reflects the performance boost resulting from building computational feature detectors based on human perception.

Visual Sentiment Classification

We also test our model’s ability to classify sentiment on the Twitter Dataset [10]. The dataset includes 603 tweets with photos and is originally collected to evaluate the performance of an automatic sentiment prediction method. Ground truth is obtained by human annotation using MTurk by the original authors (each image is judged as either positive or negative, generating a binary sentiment label) [10]. Our attribute list does not include binary sentiment categories, but we predict our model could efficiently categorize images by leveraging attributes, such as “make happy” and “interesting”, that are related to the binary sentiments. We use SVM for binary sentiment classification, with the same settings as described in “Results of Computational Modeling”.

Again we compare our model with SentiBank [10] and Machajdik’s model [9]. As shown in Fig. 10, our model outperforms comparing methods. When compared with the results on IAPS subset, the advantage of our model is less significant here. This might be because the two baseline methods are capable of extracting features that are representative of positive and negative sentiments, whereas our model, which includes multiple dimensions features, is more advantageous in predicting more delicate human sentiments, such as amusement, awe, and contentment.

Recently, deep neural networks (DNNs) have been applied to various kinds of automated visual understanding tasks. We further develop a DNN for testing the performance of deep learning on sentiment assessment. For the CNN, we first test DenseNet-121 [105] with pretrained parameters on ImageNet [106]. We replace the top layer of Densenet with a fully connected layer of 1024 neurons with ReLU activation function and attach one prediction layer with sigmoid activation function. We then fine-tune the network on the target dataset for a binary sentiment classification. The training and testing procedures are on a single NVIDIA GeForce GTX 960 M GPU using Keras with a Tensorflow backend [107, 108]. We use 20% of the images as validation set to monitor overfitting. We further try a shallower DNN AlexNet [109] with parameters pretrained on ImageNet.

As shown in Fig. 10, our model outperforms comparing methods. We notice that although DNN has proved to be powerful in object recognition and scene classification, the DNN-based models do not produce impressive performance in our task. This suggests that computer vision methods designed for visual recognition may not work well for describing the images at the sentiment level. We also observe that due to the small data set size, it is almost impossible to train a DNN without overfitting. When compared with the results on IAPS subset in the previous section, the advantage of our model is less significant here. This might be because the two baseline methods are capable of extracting features that are representative of positive and negative sentiments, whereas our model, which includes multiple dimensions features, are more advantageous in predicting more delicate human sentiments, such as amusement, awe, and contentment. Altogether the results suggest that with limited training data, our multi-dimensional features which are motivated by a deeper understanding of human perception are more efficient in the sentiment prediction tasks.
Fig. 10

Sample images with a positive and b negative sentiment predictions. c Our method outperforms other comparing methods

Discussion

In this section, we test various visual features from low level to high level. Overall, the model which integrates multi-dimensional features performs the best, suggesting that taking into the account of various channels from human perception helps boost the performance. More specifically, different levels of image features contribute differently to different image attributes. For example, local features are most effective for attention-related features. HSF features are most contributive to positive sentiments whereas LSF features are more effective for negative sentiments. Based on these findings, we apply our model to two applications related to human emotion prediction. Results demonstrate the efficacy of our model in the case of limited training data.

Conclusions and Future Directions

In this research, we perform an extensive psychophysics experiment to empirically derive insights into processes of high-level image perception. Extensive analyses on human data suggest varying importance of holistic and local cues, color information, semantics, and saliency on human perceptions of different types of attributes. Based on these finds, we build an empirical model of human visual sentiment perception. We further design a computational model for automated assessment of high-level perceptions of images. A series of comparative and application experiments demonstrate that the integration of local, global, and spatial frequency information substantially boosts prediction performance.

While we focus on human behavioral characteristics and hand-crafted features motivated by human cognition, we are aware that deep neural networks have been playing an important role in high-level visual understanding tasks. However, we still hope to advocate that understanding human behavioral characteristics is vital in creating artificial intelligence. Our experiments indeed demonstrate that by a deeper understanding of human visual perception, we can produce better computer models for high-level image understanding, even without deep learning.

There are many potentially promising fields that have yet to be explored. For example, it might be fruitful to investigate why human perception of multiple attributes is most consistent when the images are inverted. In the text, we hypothesized that this is due to the disruption of cognitive recognition process which leads humans to reply more on low-level features. However, there might be other possible explanations such as Gestalt principles in visual attention [39, 110]. Although DNN-based model does not show its advantage in our tasks, it is still important to incorporate what we have learned about human visual characteristics into DNN models, with the aim of developing more human-like deep learning [111, 112].

Notes

Acknowledgements

This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore, under its Strategic Capability Research Centres Funding Initiative. The authors want to thank Dr. Cheston Tan for his contribution to empirical modeling, and Dr. Ming Jiang, Dr. Seng-Beng Ho, and Dr. Tian-Tsong Ng for helpful discussions.

Compliance with Ethical Standards

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Supplementary material

42979_2019_61_MOESM1_ESM.pdf (9.9 mb)
Supplementary material 1 (PDF 10114 kb)

References

  1. 1.
    Sun R, Lian Z, Tang Y, Xiao J. Aesthetic visual quality evaluation of chinese handwritings. In: Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.Google Scholar
  2. 2.
    Majumdar A, Krishnan P, Jawahar C. Visual aesthetic analysis for handwritten document images. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), IEEE. 2016; p. 423–428.Google Scholar
  3. 3.
    Adak, C., Chaudhuri, B.B., Blumenstein, M.: Legibility and aesthetic analysis of handwriting. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Volume 1., IEEE (2017) 175–182.Google Scholar
  4. 4.
    Liu Y, Gu Z, Ko TH. Predicting media interestingness via biased discriminant embedding and supervised manifold regression. In: MediaEval. 2017.Google Scholar
  5. 5.
    Marquant G, Demarty CH, Chamaret C, Sirot J, Chevallier L. Interestingness prediction & its application to immersive content. In: 2018 International Conference on content-based multimedia indexing (CBMI), IEEE. 2018; p. 1–6.Google Scholar
  6. 6.
    De Heering A, Houthuys S, Rossion B. Holistic face processing is mature at 4 years of age: evidence from the composite face effect. J Exp Child Psychol. 2007;96(1):57–70.CrossRefGoogle Scholar
  7. 7.
    Ke Y, Tang X, Jing F. The design of high-level features for photo quality assessment. In: CVPR. Volume 1., IEEE. 2006; p. 419–426.Google Scholar
  8. 8.
    Roy H, Yamasaki T, Hashimoto T. Predicting image aesthetics using objects in the scene. In: Proceedings of the 2018 International Joint Workshop on multimedia artworks analysis and attractiveness computing in multimedia, ACM. 2018; p. 14–19.Google Scholar
  9. 9.
    Machajdik J, Hanbury A. Affective image classification using features inspired by psychology and art theory. In: ACM Multimedia, ACM. 2010; p. 83–92.Google Scholar
  10. 10.
    Borth D, Ji R, Chen T, Breuel T, Chang SF. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: ACM Multimedia. 2013; p. 223–232.Google Scholar
  11. 11.
    Song K, Yao T, Ling Q, Mei T. Boosting image sentiment analysis with visual attention. Neurocomputing. 2018;312:218–28.CrossRefGoogle Scholar
  12. 12.
    Khosla A, Raju AS, Torralba A, Oliva A. Understanding and predicting image memorability at a large scale. In: ICCV. 2015.Google Scholar
  13. 13.
    Jing P, Su Y, Nie L, Gu H, Liu J, Wang M. A framework of joint low-rank and sparse regression for image memorability prediction. In: IEEE Transactions on Circuits and Systems for Video Technology. 2018.Google Scholar
  14. 14.
    Zhu JY, Kr¨ahenb¨uhl P, Shechtman E, Efros. Learning a discriminative model for the perception of realism in composite images. arXiv preprint arXiv:1510.00477 2015.
  15. 15.
    Jou B, Chen T, Pappas N, Redi M, Topkara M, Chang SF. Visual affect around the world: A large-scale multilingual visual sentiment ontology. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, ACM. 2015; p. 159–168.Google Scholar
  16. 16.
    Sergent J. An investigation into component and configural processes underlying face perception. Br J Psychol. 1984;75(2):221–42.CrossRefGoogle Scholar
  17. 17.
    Schwaninger A, Lobmaier JS, Wallraven C, Collishaw S. Two routes to face perception: evidence from psychophysics and computational modeling. Cognit Sci. 2009;33(8):1413–40.CrossRefGoogle Scholar
  18. 18.
    Tan C. Towards a unified account of face (and maybe object) processing. PhD thesis, Massachusetts Institute of Technology (2012).Google Scholar
  19. 19.
    Maffei L, Fiorentini A. The visual cortex as a spatial frequency analyser. Vis Res. 1973;13(7):1255–67.CrossRefGoogle Scholar
  20. 20.
    De Valois RL, Albrecht DG, Thorell LG. Spatial frequency selectivity of cells in macaque visual cortex. Vis Res. 1982;22(5):545–59.CrossRefGoogle Scholar
  21. 21.
    Beck J, Sutter A, Ivry R. Spatial frequency channels and perceptual grouping in texture segregation. Comput Vis Graphics Image Process. 1987;37(2):299–325.CrossRefGoogle Scholar
  22. 22.
    Campbell F, Maffei L. The influence of spatial frequency and contrast on the perception of moving patterns. Vis Res. 1981;21(5):713–21.CrossRefGoogle Scholar
  23. 23.
    Moore RS, Stammerjohan CA, Coulter RA. Banner advertiser-web site context congruity and color effects on attention and attitudes. J Advers. 2005;34(2):71–84.CrossRefGoogle Scholar
  24. 24.
    Li X. The application and effect analysis of colour in new media advertisement. In: 7th International Conference on management, education, information and control (MEICI 2017), Atlantis Press. 2017.Google Scholar
  25. 25.
    Isola P, Xiao J, Parikh D, Torralba A, Oliva. What makes a photograph memorable? Pattern analysis and machine intelligence. IEEE Trans. 2014;36(7):1469–82.Google Scholar
  26. 26.
    Datta, R., Li, J., Wang, J.Z.: Algorithmic inferencing of aesthetics and emotion in natural images: An exposition. In: ICIP, IEEE (2008) 105–108.Google Scholar
  27. 27.
    Wu Y, Bauckhage C, Thurau C: The good, the bad, and the ugly: Predicting aesthetic image labels. In: Pattern Recognition (ICPR), 2010 20th International Conference on, IEEE. 2010; p. 1586–1589.Google Scholar
  28. 28.
    Gygli M, Grabner H, Riemenschneider H, Nater F, Gool LV. The interestingness of images. In: ICCV, IEEE. 2013; p. 1633–1640.Google Scholar
  29. 29.
    Khosla A, Das Sarma A, Hamid R. What makes an image popular? In: Proceedings of the 23rd international conference on World wide web, International World Wide Web Conferences Steering Committee. 2014; p. 867–876.Google Scholar
  30. 30.
    Lalonde J, Efros A. Using color compatibility for assessing image realism. In: ICCV. 2007.Google Scholar
  31. 31.
    Fan S, Ng TT, Herberg JS, Koenig BL, Tan CYC, Wang R. An automated estimator of image visual realism based on human cognition. In: CVPR, IEEE. 2014; p. 4201–4208.Google Scholar
  32. 32.
    Lu X, Suryanarayan P, Adams Jr, RB, Li J, Newman MG, Wang JZ. On shape and the computability of emotions. In: Proceedings of the 20th ACM international conference on Multimedia, ACM. 2012; p. 229–238.Google Scholar
  33. 33.
    Yang J, She D, Lai YK, Rosin PL, Yang MH. Weakly supervised coupled networks for visual sentiment analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018; p. 7584–7592.Google Scholar
  34. 34.
    Dubey, R., Peterson, J., Khosla, A., Yang, M.H., Ghanem.: What makes an object memorable? In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1089–1097.Google Scholar
  35. 35.
    Chen T, Borth D, Darrell T, Chang SF. Deepsentibank: visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586 2014.
  36. 36.
    Attneave F. Some informational aspects of visual perception. Psychol Rev. 1954;61(3):183.CrossRefGoogle Scholar
  37. 37.
    Gordon IE. Theories of visual perception. Hove: Psychology Press; 2004.CrossRefGoogle Scholar
  38. 38.
    Rhodes G. The evolutionary psychology of facial beauty. Annu Rev Psychol. 2006;57:199–226.CrossRefGoogle Scholar
  39. 39.
    Wagemans J, Elder JH, Kubovy M, Palmer SE, Peterson MA, Singh M, von der Heydt R. A century of gestalt psychology in visual perception: I. perceptual grouping and figure–ground organization. Psychol Bull. 2012;138(6):1172.CrossRefGoogle Scholar
  40. 40.
    Bruce V, Young AW. Face perception. Hove: Psychology Press; 2012.Google Scholar
  41. 41.
    Tanaka J, Gauthier I. Expertise in object and face recognition. Psychol Learn Motiv. 1997;36:83–125.CrossRefGoogle Scholar
  42. 42.
    Oliva A, Torralba A. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis. 2001;42(3):145–75.zbMATHCrossRefGoogle Scholar
  43. 43.
    Peterson MA, Rhodes G. Perception of faces, objects, and scenes: Analytic and holistic processes. Oxford: Oxford University Press; 2003.Google Scholar
  44. 44.
    Rhodes G, Byatt G, Michie PT, Puce A. Is the fusiform face area specialized for faces, individuation, or expert individuation? J Cognit Neurosci. 2004;16(2):189–203.CrossRefGoogle Scholar
  45. 45.
    Daugman JG. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. JOSA A. 1985;2(7):1160–9.CrossRefGoogle Scholar
  46. 46.
    DeValois RL, DeValois KK. Spatial vision. Oxford: Oxford University Press; 1990.Google Scholar
  47. 47.
    Harris CS. Visual coding and adaptability. Hove: Psychology Press; 2014.CrossRefGoogle Scholar
  48. 48.
    Tamura H, Mori S, Yamawaki T. Textural features corresponding to visual perception. IEEE Transa Syst Man Cybern. 1978;8(6):460–73.CrossRefGoogle Scholar
  49. 49.
    Watt R. Scanning from coarse to fine spatial scales in the human visual system after the onset of a stimulus. JOSA A. 1987;4(10):2006–21.CrossRefGoogle Scholar
  50. 50.
    Bar M, Kassam KS, Ghuman AS, Boshyan J, Schmid AM, Dale AM, H¨am¨al¨ainen MS, Marinkovic K, Schacter DL, Rosen BR, et al. Top-down facilitation of visual recognition. Proc Natl Acad Sci USA. 2006;103(2):449–54.CrossRefGoogle Scholar
  51. 51.
    Hussein A, Boix X, Poggio T. Training neural networks for object recognition using blurred images. In: APS march meeting abstracts. Volume 2019. 2019; p. G70.012. https://ui.adsabs.harvard.edu/abs/2019APS..MARG70012H.
  52. 52.
    Judd T, Durand F, Torralba A. Fixations on low-resolution images. J Vis. 2011;11(4):14–14.CrossRefGoogle Scholar
  53. 53.
    R¨ohrbein F, Goddard P, Schneider M, James G, Guo K. How does image noise affect actual and predicted human gaze allocation in assessing image quality? Vis Res. 2015;112:11–25.CrossRefGoogle Scholar
  54. 54.
    Posner MI, Petersen SE. The attention system of the human brain. Technical report, DTIC Document. 1989Google Scholar
  55. 55.
    Chun MM. Contextual cueing of visual attention. Trends Cognit Sci. 2000;4(5):170–8.CrossRefGoogle Scholar
  56. 56.
    Lang PJ, Bradley MM. The international affective picture system (iaps) in the study of emotion and attention. In: Handbook of emotion elicitation and assessment, volume 29. New York, NY: Oxford University Press; 2007.Google Scholar
  57. 57.
    Fan S, Shen Z, Jiang M, Koenig BL, Xu J, Kankanhalli MS, Zhao Q. Emotional attention: A study of image sentiment and visual attention. In: Proceedings of the IEEE Conference on computer vision and pattern recognition. 2018; p. 7521–7531.Google Scholar
  58. 58.
    Cordel M, Fan S, Shen Z, Kankanhalli MS. Emotion-aware human attention. In: Proceedings of the IEEE Conference on computer vision and pattern recognition. 2019.Google Scholar
  59. 59.
    Wong LK, Low KL. Saliency-enhanced image aesthetics class prediction. In: 2009 16th IEEE International Conference on image processing (ICIP), IEEE. 2009; p. 997–1000.Google Scholar
  60. 60.
    Khosla A, Xiao J, Torralba A, Oliva A. Memorability of image regions. In: Advances in neural information processing systems. 2012; p. 296–304.Google Scholar
  61. 61.
    Mancas M, Le Meur O. Memorability of natural scenes: the role of attention. In: 2013 IEEE International Conference on Image Processing, IEEE. 2013; p. 196–200.Google Scholar
  62. 62.
    Fajtl J, Argyriou V, Monekosso D, Remagnino P. Amnet: memorability estimation with attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018; p. 6363–6372.Google Scholar
  63. 63.
    Paolacci G, Chandler J, Ipeirotis P. Running experiments on amazon mechanical turk. Judgm Decis Making. 2010;5(5):411–9.Google Scholar
  64. 64.
    Fan S, Jiang M, Shen Z, Koenig BL, Kankanhalli MS, Zhao Q. The role of visual attention in sentiment prediction. In: Proceedings of the 25th ACM international conference on Multimedia, ACM, 2017; p. 217–225.Google Scholar
  65. 65.
    Huang X, Shen C, Boix X, Zhao Q. Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: The IEEE International Conference on Computer Vision (ICCV). 2015.Google Scholar
  66. 66.
    Russell BC, Torralba A, Murphy KP, Freeman WT. Labelme: a database and web-based tool for image annotation. Int J Comput Vis. 2008;77(1–3):157–73.CrossRefGoogle Scholar
  67. 67.
    Young A, Hellawell D, Hay D. Configurational information in face perception. Perception. 1987;16(6):747–59.CrossRefGoogle Scholar
  68. 68.
    Goffaux V, Rossion B. Faces are” spatial”–holistic face perception is supported by low spatial frequencies. J Exp Psychol Hum Percept Perform. 2006;32(4):1023.CrossRefGoogle Scholar
  69. 69.
    Oliva A, Torralba A, Schyns PG. Hybrid images. In: ACM Transactions on Graphics (TOG), volume 25, ACM. 2006; p. 527–532.CrossRefGoogle Scholar
  70. 70.
    Schyns PG, Oliva A. Dr. angry and Mr. smile: When categorization flexibly modifies the perception of faces in rapid visual presentations. Cognition. 1999;69(3):243–65.CrossRefGoogle Scholar
  71. 71.
    Allahbakhsh M, Benatallah B, Ignjatovic A, Motahari-Nezhad HR, Bertino E, Dustdar S. Quality control in crowdsourcing systems: issues and directions. IEEE Internet Comput. 2013;17(2):76–81.CrossRefGoogle Scholar
  72. 72.
    Ma X, Hancock JT, Mingjie KL, Naaman M. Self-disclosure and perceived trustworthiness of airbnb host profiles. In: CSCW. 2017; p. 2397–2409.Google Scholar
  73. 73.
    Rodríguez-Pardo C, Bilen H. Personalised aesthetics with residual adapters. In: Morales A, Fierrez J, Sánchez JS, Ribeiro B, editors. Pattern recognition and image analysis. Cham: Springer. 2019; p. 508–520. ISBN 978-3-030-31332-6.CrossRefGoogle Scholar
  74. 74.
    Pardo A, Jovanovic J, Dawson S, Gaˇsevi´c D, Mirriahi N. Using learning analytics to scale the provision of personalised feedback. Br J Educ Technol. 2019;50(1):128–38.CrossRefGoogle Scholar
  75. 75.
    Kreft IG, Kreft I, de Leeuw J. Introducing multilevel modeling. Newcastle upon Tyne: Sage; 1998.CrossRefGoogle Scholar
  76. 76.
    Weiss NA, Weiss CA. Introductory statistics. London: Pearson Education USA; 2012.zbMATHGoogle Scholar
  77. 77.
    Valdez P, Mehrabian A. Effects of color on emotions. J Exp Psychol Gen. 1994;123(4):394.CrossRefGoogle Scholar
  78. 78.
    Sokolova MV, Fern´andez-Caballero A, Ros L, Latorre JM, Serrano JP. Evaluation of color preference for emotion regulation. In: Artificial computation in biology and medicine. Springer. 2015; p. 479–487.Google Scholar
  79. 79.
    Datta R, Joshi D, Li J, Wang JZ. Studying aesthetics in photographic images using a computational approach. In: European conference on computer vision. Berlin, Heidelberg: Springer; 2006. p. 288–301.Google Scholar
  80. 80.
    Moshagen M, Thielsch MT. Facts of visual aesthetics. Int J Hum Comput Stud. 2010;68(10):689–709.CrossRefGoogle Scholar
  81. 81.
    Valentine T. A unified account of the effects of distinctiveness, inversion, and race in face recognition. Q J Exp Psychol Sect A. 1991;43(2):161–204.CrossRefGoogle Scholar
  82. 82.
    Farah MJ, Tanaka JW, Drain HM. What causes the face inversion effect? J Exp Psychol Hum Percept Perform. 1995;21(3):628.CrossRefGoogle Scholar
  83. 83.
    Delplanque S, Ndiaye K, Scherer K, Grandjean D. Spatial frequencies or emotional effects?: a systematic measure of spatial frequencies for iaps pictures by a discrete wavelet analysis. J Neurosci Methods. 2007;165(1):144–50.CrossRefGoogle Scholar
  84. 84.
    Lang PJ, Bradley MM, Cuthbert BN. Emotion, attention, and the startle reflex. Psychol Rev. 1990;97(3):377.CrossRefGoogle Scholar
  85. 85.
    Wells A, Matthews G. Attention and emotion. London: LEA; 1994.Google Scholar
  86. 86.
    Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell. 1998;11:1254–9.CrossRefGoogle Scholar
  87. 87.
    Riesenhuber M, Poggio T. Hierarchical models of object recognition in cortex. Nat Neurosci. 1999;2(11):1019–25.CrossRefGoogle Scholar
  88. 88.
    Rossion B, Gauthier I. How does the brain process upright and inverted faces? Behav Cognit Neurosci Rev. 2002;1(1):63–75.CrossRefGoogle Scholar
  89. 89.
    Gomes CF, Brainerd CJ, Stein LM. Effects of emotional valence and arousal on recollective and non recollective recall. J Exp Psychol Learn Mem Cognit. 2013;39(3):663.CrossRefGoogle Scholar
  90. 90.
    Poggio T, Girosi F. Networks for approximation and learning. Proc IEEE. 1990;78(9):1481–97.zbMATHCrossRefGoogle Scholar
  91. 91.
    Gauthier I, Tarr M, et al. Becoming a “greeble” expert: exploring mechanisms for face recognition. Vis Res. 1997;37(12):1673–82.CrossRefGoogle Scholar
  92. 92.
    Wong YK, Folstein JR, Gauthier I. The nature of experience determines object representations in the visual system. J Exp Psychol Gener. 2012;141(4):682.CrossRefGoogle Scholar
  93. 93.
    Cox D, Pinto N. Beyond simple features: a large-scale feature search approach to unconstrained face recognition. In: Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, IEEE, 2011, pp 8–15.Google Scholar
  94. 94.
    Isola P, Xiao J, Torralba A, Oliva A. What makes an image memorable? In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp 145–152.Google Scholar
  95. 95.
    Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A. Sun database: Large-scale scene recognition from abbey to zoo. In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, IEEE, 2010, pp 3485–3492.Google Scholar
  96. 96.
    Oliva A, Torralba A. Building the gist of a scene: the role of global image features in recognition. Prog Brain Res. 2006;155:23–36.CrossRefGoogle Scholar
  97. 97.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, volume 1, IEEE, 2005; pp 886–893.Google Scholar
  98. 98.
    Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D. Object detection with discriminatively trained part-based models. TPAMI. 2010;32(9):1627–45.CrossRefGoogle Scholar
  99. 99.
    Li LJ, Su H, Fei-Fei L, Xing EP. Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Advances in neural information processing systems. 2010; pp 1378–1386.Google Scholar
  100. 100.
    Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum HY. Learning to detect a salient object. IEEE Trans Pattern Anal Mach Intell. 2011;33(2):353–67.CrossRefGoogle Scholar
  101. 101.
    Srivastava A, Lee AB, Simoncelli EP, Zhu SC. On advances in statistical modeling of natural images. J Math Imaging Vis. 2003;18(1):17–33.MathSciNetzbMATHCrossRefGoogle Scholar
  102. 102.
    Chang CC, Lin CJ. Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST). 2011;2(3):27.Google Scholar
  103. 103.
    Lang, P.J., Bradley, M.M., Cuthbert, B.N.: International affective picture system (iaps): Affective ratings of pictures and instruction manual. Technical report A-8 (2008).Google Scholar
  104. 104.
    Mikels JA, Fredrickson BL, Larkin GR, Lindberg CM, Maglio SJ, Reuter-Lorenz PA. Emotional category data on images from the international affective picture system. Behav Res Methods. 2005;37(4):626–30.CrossRefGoogle Scholar
  105. 105.
    Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017; pp 4700–4708.Google Scholar
  106. 106.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. Imagenet large scale visual recognition challenge. Int J Comput Vis. 2015;115(3):211–52.MathSciNetCrossRefGoogle Scholar
  107. 107.
    Chollet F. Keras. GitHub repository. 2015.Google Scholar
  108. 108.
    Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Dean ADJeffrey, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jozefowicz YJR, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga R, Moore S, Olah DMC, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vasudevan VVV, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yuan Y, Xiaoqiang Z. Tensorflow: large-scale machine learning on heterogeneous distributed systems. 2016. arXiv preprint arXiv:1603.04467.
  109. 109.
    Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in neural information processing systems, vol. 25. Red Hook: Curran Associates Inc.; 2012. p. 1097–105.Google Scholar
  110. 110.
    Rock I, Palmer S. The legacy of gestalt psychology. Sci Am. 1990;263(6):84–91.CrossRefGoogle Scholar
  111. 111.
    Sabour S, Frosst N, Hinton GE. Matrix capsules with em routing. In: 6th International Conference on Learning Representations, ICLR, 2018.Google Scholar
  112. 112.
    Arend L, Han Y, Schrimpf M, Bashivan P, Kar K, Poggio T, DiCarlo JJ, Boix X. Single units in a deep neural network functionally correspond with neurons in the brain: preliminary results. Technical report, Center for Brains, Minds and Machines (CBMM) (2018).Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd 2020

Authors and Affiliations

  1. 1.School of ComputingNational University of SingaporeSingaporeSingapore
  2. 2.Psychology DepartmentSouthern Utah UniversityCedar CityUSA
  3. 3.Department of Computer Science and EngineeringUniversity of MinnesotaMinneapolisUSA

Personalised recommendations