Keywords

1 Introduction

The ultimate aim of computer vision has always been to enable computers to understand images the way humans do. With the latest advances in deep learning technologies, the availability of large volumes of training data and the use of powerful graphic processing units, computer vision systems are now able to locate and classify objects in natural images with high accuracy, surpassing human performance in some specific tasks. However, we are still a long way from human-like analysis and extraction of high-level semantics from images. This work aims to push high-level image recognition by enabling machines to interpret art.

To study automatic interpretation of art, we introduce SemArtFootnote 1, a dataset for semantic art understanding. We build SemArt by gathering a collection of fine-art images, each with its respective attributes (author, type, school, etc.) as well as a short artistic comment or description, such as those that commonly appear in art catalogues or museum collections. Artistic comments involve not only descriptions of the visual elements that appear in the scene but also references to its technique, author or context. Some examples of the dataset are shown in Fig. 1.

Fig. 1.
figure 1

SemArt dataset samples. Each sample is a triplet of image, attributes and artistic comment.

We address semantic art understanding by proposing a number of models that map paintings and artistic comments into a common semantic space, thus enabling comparison in terms of semantic similarity. To evaluate and benchmark the proposed models, we design the Text2Art challenge as a multi-modal retrieval task. The aim of the challenge is to evaluate whether the models capture enough of the insights and clues provided by the artistic description to be able to match it to the correct painting.

A key difference with previously proposed methods for semantic understanding of natural images (e.g. MS-COCO dataset [15]) is that our system relies on background information on art history and artistic styles. As already noted in previous work [3,4,5], paintings are substantially different from natural images in several aspects. Firstly, paintings, unlike natural images, are figurative representations of people, objects, places or situations which may or may not correspond to the real world. Secondly, the study of fine-art paintings usually requires previous knowledge about history of art, different artistic styles as well as contextual information about the subjects represented. Thirdly, paintings commonly exhibit one or more layers of abstraction and symbolism which creates ambiguity in interpretation.

In this work, we harness existing prior knowledge about art and deep neural networks to model understanding of fine-art paintings. Specifically, our contributions are:

  1. 1.

    to introduce the first dataset for semantic art understanding in which each sample is a triplet of images, attributes and artistic comments,

  2. 2.

    to propose models to map fine-art paintings and their high-level artistic descriptions onto a joint semantic space,

  3. 3.

    to design an evaluation protocol based on multi-modal retrieval for semantic art understanding, so that future research can be benchmarked under a common, public framework.

Table 1. Datasets for art analysis. Meta and Text columns state if image metadata and textual information are provided, respectively.

2 Related Work

With the digitalization of large collections of fine-art paintings and the emergence of publicly available online art catalogs such as WikiArtFootnote 2 or the Web Gallery of ArtFootnote 3, computer vision researchers become interested in analyzing fine-art paintings automatically. Early work [2, 10, 12, 23] proposes methods based on handcrafted visual features to identify an author and/or a specific style in a piece of art. Datasets used in these kinds of approaches, such as PRINTART [2] and Painting-91 [12], are rather small, with 988 and 4,266 painting images, respectively. Mensink and Van Gemert introduce in [19] the large-scale Rijksmuseum dataset for multi-class prediction, consisting on 112,039 images from artistic objects, although only 3,593 are from fine-art paintings. With the success of convolutional neural networks (CNN) in large-scale image classification [14], deep features from CNNs replace handcrafted image representations in many computer vision applications, including painting image classification [1, 11, 16, 18, 21, 25], and larger datasets are made publicly available [11, 18]. In these methods, paintings are fed into a CNN to predict its artistic style or author by studying its visual aesthetics.

Besides painting classification, other work is focused on exploring image retrieval in artistic paintings. For example, in [2], monochromatic painting images are retrieved by using artistic-related keywords, whereas in [22] a pre-trained CNN is fine-tuned to find paintings with similar artistic motifs. Crowley and Zisserman [4] explore domain transfer to retrieve image of portraits from real faces, in the same way as [3] and [6] explore domain transfer to perform object recognition in paintings.

A summary of the existing datasets for fine-art understanding is shown in Table 1. In essence, previous work studies art from an aesthetics point of view to classify paintings according to author and style [2, 11, 12, 18, 19], to find relevant paintings according to a query input [2, 4, 22] or to identify objects in artistic representations [3]. However, understanding art involves also identifying the symbolism of the elements, the artistic influences or the historical context of the work. To study such complex processes, we propose to interpret fine-art paintings in a semantic way by introducing SemArt, a multi-modal dataset for semantic art understanding. To the best of our knowledge, SemArt is the first corpus that provides not only fine-art images and their attributes, but also artistic comments for the semantic understanding of fine-art paintings.

3 SemArt Dataset

3.1 Data Collection

To create the SemArt dataset, we collect artistic data from the Web Gallery of Art (WGA), a website with more than 44,809 images of European fine-art reproductions between the 8th and the 19th century. WGA provides links to all their images in a downloadable comma separated values file (CSV). In the CSV file, each image is associated with some attributes or metadata: author, author’s birth and death, title, date, technique, current location, form, type, school and time-line. Following the links provided in the CSV file, we only collect images from artworks whose field form is set as painting, as opposite to images of other forms of art such as sculpture or architecture.

We create a script to collect artistic comments for each painting image, as they are not provided in the aforementioned CSV file. We omit images that are not associated to any comment and we remove irrelevant metadata fields, such as author’s birth and death and current location. The final size of the cleaned collection is downsampled to 21,384 triplets, where each triplet is formed by an image, a text and a number of attributes.

3.2 Data Analysis

For each sample, the metadata is provided as a set of seven fields, which describe the basic attributes of its associated painting: Author, Title, Date, Technique, Type, School and Timeframe. In total, there are 3,281 different authors, the most frequent one being Vincent van Gogh with 327 paintings. There are 14,902 different titles in the dataset, with 38.8% of the paintings presenting a non-unique title. Among all the titles, Still-Life and Self-Portrait are the most common ones. Technique and Date fields are not available for all samples, but provided for completeness. Type field classifies paintings according to ten different genres, such as religious, landscape or portrait. There are 26 artistic schools in the collection, Italian being the most common, with 8,860 paintings and Finnish the least frequent with just 5 samples. Also, there are 22 different timeframes, which are periods of 50 years evenly distributed between 801 and 1900. The distribution of values over the fields Type, School and Timeframe is shown in Fig. 2. With respect to artistic comments, the vocabulary set follows the Zipf’s law [17]. Most of the comments are relatively short, with almost 70% of the them containing 100 words or less. Images are provided in different aspect ratios and sizes. The dataset is randomly split into training, validation and test sets with 19,244, 1,069 and 1,069 triplets, respectively.

Fig. 2.
figure 2

Metadata distribution. Distribution of samples within the SemArt dataset in Timeframe, School and Type attributes.

4 Text2Art Challenge

In what follows, we use bold style to refer to vectors and matrices (e.g. \(\varvec{x}\) and \(\varvec{W}\)). Given a collection of artistic samples K, the k-th sample in K is given by the triplet \((img_k, com_k, att_k)\), being \(img_k\) the artistic image, \(com_k\) the artistic comment and \(att_k\) the artistic attributes. Images, comments and attributes are input into specific encoding functions, \(f_{img}\), \(f_{com}\), \(f_{att}\), to map raw data from the corpus into vector representations, \(\varvec{i}_k\), \(\varvec{c}_k\), \(\varvec{a}_k\), as:

$$\begin{aligned} \varvec{i}_k = f_{img}(img_k;\phi _{img}) \end{aligned}$$
(1)
$$\begin{aligned} \varvec{c}_k = f_{com}(com_k;\phi _{com}) \end{aligned}$$
(2)
$$\begin{aligned} \varvec{a}_k = f_{att}(att_k;\phi _{att}) \end{aligned}$$
(3)

where \(\phi _{img}\), \(\phi _{com}\) and \(\phi _{att}\) are the parameters of each encoding function.

As comment encodings, \(\varvec{c}_k\), and attribute encodings, \(\varvec{a}_k\), are both from textual data, a joint textual vector, \(\varvec{t}_k\) can be obtained as:

$$\begin{aligned} \varvec{t}_k = \varvec{c}_k \oplus \varvec{a}_k \end{aligned}$$
(4)

where \(\oplus \) is vector concatenation.

The transformation functions, \(g_{vis}\) and \(g_{text}\), can be defined as the functions that project the visual and the textual encodings into a common multi-modal space. The projected vectors \(\varvec{p}^{vis}_k\) and \(\varvec{p}^{text}_k\) are then obtained as:

$$\begin{aligned} \varvec{p}^{vis}_k = g_{vis}(\varvec{i}_k;\theta _{vis}) \end{aligned}$$
(5)
$$\begin{aligned} \varvec{p}^{text}_k = g_{text}(\varvec{t}_k;\theta _{text}) \end{aligned}$$
(6)

being \(\theta _{vis}\) and \(\theta _{text}\) the parameters of each transformation function.

For a given similarity function d, the similarity between any text (i.e. pair of comments and attributes) and any image in K is measured as the distance between their projections:

$$\begin{aligned} d(\varvec{p}^{text}_k, \varvec{p}^{vis}_j) = d(g_{text}(\varvec{t}_k;\theta _{text}), g_{vis}(\varvec{i}_j;\theta _{vis})) \end{aligned}$$
(7)

In semantic art understanding, the aim is to learn \(f_{img}\), \(f_{com}\), \(f_{att}\), \(g_{vis}\) and \(g_{text}\) such that images, comments and attributes from the same sample are mapped closer in terms of d than images, texts and attributes from different samples:

$$\begin{aligned} d(\varvec{p}^{text}_k, \varvec{p}^{vis}_k) < d(\varvec{p}^{text}_k, \varvec{p}^{vis}_j) \text { for all } k, j \le |K| \end{aligned}$$
(8)

and

$$\begin{aligned} d(\varvec{p}^{text}_k, \varvec{p}^{vis}_k) < d(\varvec{p}^{text}_j, \varvec{p}^{vis}_k) \text { for all } k, j \le |K| \end{aligned}$$
(9)

To evaluate semantic art understanding, we propose the Text2Art challenge as a multi-modal retrieval problem. Within Text2Art, we define two tasks: text-to-image retrieval and image-to-text retrieval. In text-to-image retrieval, the aim is to find the most relevant painting in the collection, \(img^* \in K\), given a query comment and its attributes:

$$\begin{aligned} img^* = \mathop {\mathrm {arg min}}\limits _{img_j \in K} d(\varvec{p}^{text}_k, \varvec{p}^{vis}_j) \end{aligned}$$
(10)

Similarly, in the image-to-text retrieval task, when a painting image is given, the aim is to find the comment and the attributes, \(com^* \in K\) and \(att^* \in K\), that are more relevant to the visual query:

$$\begin{aligned} com^*, att^* = \mathop {\mathrm {arg min}}\limits _{com_j, att_j \in K} d(\varvec{p}^{text}_j, \varvec{p}^{vis}_k) \end{aligned}$$
(11)

5 Models for Semantic Art Understanding

We propose several models to learn meaningful textual and visual encodings and transformations for semantic art understanding. First, images, comments and attributes are encoded into visual and textual vectors. Then, a multi-modal transformation model is used to map these visual and textual vectors into a common multi-modal space where a similarity function is applied.

5.1 Visual Encoding

We represent each painting image as a visual vector, \(\varvec{i}_k\), using convolutional neural networks (CNNs). We use different CNN architectures, such as VGG16 [24], different versions of ResNet [8] and RMAC [26].

 

VGG16 :

[24] contains 13 \(3\times 3\) convolutional layers and three fully-connected layers stacked on top of each other. We use the output of one of the fully connected layers as the visual encoding.

ResNet :

[8] uses shortcut connections to connect the input of a layer to the output of a deeper layer. There exist many versions depending on the number of layers, such as ResNet50 and ResNet152 with 50 and 152 layers, respectively. We use the output of the last layer as the visual encoding.

RMAC :

is a visual descriptor introduced by Tolias et al. in [26] for image retrieval. The activation map from the last convolutional layer from a CNN model is max-pooled over several regions to obtain a set of regional features. The regional features are post-processed, sum-up together and normalized to obtain the final visual representation.

 

5.2 Textual Encoding

With respect to the textual information, comments are encoded into a comment vector, \(\varvec{c}_k\), and attributes are encoded into an attribute vector, \(\varvec{a}_k\). To get the joint textual encoding, \(\varvec{t}_k\), both vectors are concatenated.

Comment Encoding. To encode comments into a comment vector, \(\varvec{c}_k\), we first build a comment vocabulary, \(V_C\). \(V_C\) contains all the alphabetic words that appear at least ten times in the training set. The comment vector is obtained using three different techniques: a comment bag-of-words (BOWc), a comment multi-layer perceptron (MLPc) and a comment recurrent model (LSTMc).

 

BOW c :

each comment is encoded as a term frequency - inverse document frequency (tf-idf) vector by weighting each word in the comment by its relevance within the corpus.

MLP c :

comments are encoded as a tf-idf vectors and fed into a fully connected layer with tanh activationFootnote 4 and \(\ell _2\)-normalization. The output of the normalization layer is used as the comment encoding.

LSTM c :

each sentence in a comment is encoded into a sentence vector using a 2,400 dimensional pre-trained skip-thought model [13]. Sentence vectors are input into a long short-term memory network (LSTM) [9]. The last state of the LSTM is \(\ell _2\)-normalized and used as the comment encoding.

 

Attribute Encoding. We use the attribute field Title in the metadata to provide an extra textual information to our model. We propose three different techniques to encode titles into attribute encodings, \(\varvec{a}_k\): an attribute bag-of-words (BOWa) an attribute multi-layer perceptron (MLPa) and an attribute recurrent model (LSTMa).

 

BOW a :

as in comments, titles are encoded as a tf-idf-weighted vector using a title vocabulary, \(V_T\). \(V_T\) is built with all the alphabetic words in the titles of the training set.

MLP a :

also as in comments, tf-idf encoded titles are fed into a fully connected layer with tanh activation and a \(\ell _2\)-normalization. The output of the normalization layer is used as the attribute vector.

LSTM a :

in this case, each word in a title is fed into an embedding layer followed by a LSTM network. The output of the last state of the LSTM is \(\ell _2\)-normalized and used as the attribute encoding.

 

5.3 Multi-modal Transformation

The visual and textual encodings, \(\varvec{i}_k\) and \(\varvec{t}_k\) respectively, encode visual and textual data into two different spaces. We use a multi-modal transformation model to map the visual and textual representations into a common multi-modal space. In this common space, textual and visual information can be compared in terms of the similarity function d. We propose three different models, which are illustrated in Fig. 3.

Fig. 3.
figure 3

Multi-modal transformation models. Models for mapping textual and visual representations into a common multi-modal space.

 

CCA :

Canonical Correlation Analysis (CCA) [7] is a linear approach for projecting data from two different sources into a common space by maximizing the normalized correlation between the projected data. The CCA projection matrices are learnt by using training pairs of samples from the corpus. At test time, the textual and visual encodings from a test sample are projected using these CCA matrices.

CML :

Cosine Margin Loss (CML) is a deep learning architecture trained end-to-end to learn the visual and textual encodings and their projections all at once. Each image encoding is fed into a fully connected layer followed by a tanh activation function and a \(\ell _2\)-normalization layer to project the visual feature, \(\varvec{i}_j\), into a D-dimensional space, obtaining the projected visual vector \(\varvec{p}^{vis}_j\). Similarly, each textual vector \(\varvec{t}_k\), is input into another network with identical layer structure (fully connected layer with tanh activation and \(\ell _2\)-normalization) to map the textual feature into the same D-dimensional space, obtaining the projected textual vector \(\varvec{p}^{text}_k\). We train the CML model with both positive (\(k = j\)) and negative (\(k \ne j\)) pairs of textual and visual data and cosine similarity with margin as the loss function:

$$\begin{aligned} \begin{aligned}L_{CML}(\varvec{p}^{vis}_k, \varvec{p}^{text}_j) = {\left\{ \begin{array}{ll} 1 - \cos (\varvec{p}^{vis}_k, \varvec{p}^{text}_j), &{} \text {if } k = j \\ \max (0, \cos (\varvec{p}^{vis}_k, \varvec{p}^{text}_j) - m), &{} \text {if } k \ne j \end{array}\right. }\end{aligned} \end{aligned}$$
(12)

where \(\text {cos}\) is the cosine similarity between two normalized vectors and m is the margin hyperparameter.

AMD :

Augmented Metadata (AMD) is a model in which the network is informed with attribute data for an extra alignment between the visual and the textual encodings. The AMD model consists on a deep learning architecture that projects both visual and textual vectors into the common multi-modal space whereas, at the same time, ensures that the projected encodings are meaningful in the art domain. As in the CML model, image and textual encodings are projected into D-dimensional vectors using fully connected layers, and the loss between the multi-modal transformations is computed using a cosine margin loss. Attribute metadata is used to train a pair of classifiers on top of the projected data (Fig. 3, AMD Model), each classifier consisting of a fully connected layer without activation. Metadata classifiers are trained using a standard cross entropy classification loss function:

$$\begin{aligned} L_{META}(\varvec{x}, class) = -\log \left( \frac{\exp (\varvec{x}[class])}{\sum _j \exp (\varvec{x}[j])}\right) \end{aligned}$$
(13)

which contribute to the total loss of the model in addition to the cosine margin loss. The total loss of the model is then computed as:

$$\begin{aligned} \begin{aligned} L_{AMD}(\varvec{p}^{text}_k, \varvec{p}^{vis}_j,l_{p^{text}_k}, l_{p^{vis}_j}) = (1 - 2\alpha ) L_{CML}(\varvec{p}^{text}_k, \varvec{p}^{vis}_j)\\ +\ \alpha L_{META}(\varvec{p}^{text}_k, l_{p^{text}_k}) \\ +\ \alpha L_{META}(\varvec{p}^{vis}_j, l_{p^{vis}_j}) \end{aligned} \end{aligned}$$
(14)

where \(l_{p^{text}_k}\) and \(l_{p^{vis}_j}\) are the class labels of the k-th text and the j-th image, respectively, and \(\alpha \) is the weight of the classifier loss.

 

6 Experiments

Experimental Details. In the image encoding part, each network is initialized with its standard pre-trained weights for image classification. Images are scaled down to 256 pixels per side and randomly cropped into \(224 \times 224\) patches. Visual data is augmented by randomly flipping images horizontally. In the textual encoding part, the dimensionality of LSTM hidden state for comments is 1,024, whereas in the LSTM for titles is 300. The title vocabulary size is 9,092. Skip thoughts dimensionality is set to 2,400. In the multi-modal transformation part, the CCA matrices are learnt using scikit-learn [20]. For the deep learning architectures, we use Adam optimizer and the learning rate is set to 0.0001, m to 0.1 and \(\alpha \) to 0.01. Training is conducted in mini batches of 32 samples. Cosine similarity is used as the similarity function d in all of our models.

Text2Art Challenge Evaluation. Painting images are ranked according to their similarity to a given text, and vice versa. The ranking is computed on the whole set of test samples and results are reported as median rank (MR) and recall rate at K (R@K), with K being 1, 5 and 10. MR is the value separating the higher half of the relevant ranking position amount all samples, so the lower the better. Recall at rate K is the rate of samples for which its relevant image is in the top K positions of the ranking, so the higher the better.

6.1 Visual Domain Adaptation

We first evaluate the transferability of visual features from the natural image domain to the artistic domain. In this experiment, texts are encoded with the BOWc approach with \(V_C =\) 3,000. As multi-modal transformation model, a 128-dimensional CCA is used. We extract visual encodings from networks pre-trained for classification of natural images without further fine-tunning or refinement. For the VGG16 model, we extract features from the first, second and third fully connected layer (VGG16FC1 , VGG16FC2 and VGG16FC3). For the ResNet models, we consider the visual features from the output of the networks (ResNet50 and ResNet152). Finally, RMAC representation is computed using a VGG16, a ResNet50 and a ResNet152 (RMACVGG16 , RMACRes50 and RMACRes152). Results are detailed in Table 2. As semantic art understanding is a high-level task, it is expected that representations acquired from deeper layers perform better, as in the VGG16 models, where the deepest layer of the network obtains the best performance. RMAC features respond well when transferring from natural images to art, although ResNet models obtain the best performance. Considering these results, we use ResNets as visual encoders in the following experiments.

Table 2. Visual Domain Adaptation. Transferability of visual features from the natural image classification domain to the Text2Art challenge.
Table 3. Text Encoding in Art. Comparison between different text encodings in the Text2Art challenge.

6.2 Text Encoding in Art

We then compare the performance between the different text encoding models in the Text2Art challenge. In this experiment, images are encoded with a ResNet50 network and the CML model is used to learn the mapping of the visual and the textual encodings into a common 128-dimensional space. The different encoding methods are compared in Table 3. The best performance is obtained when using the simplest bag-of-words approach both for comments and titles (BOWc and BOWa), although the multi-layer perceptron model (MLPc and MLPa) obtain similar results. Models based on recurrent networks (LSTMc and LSTMa) are not able to capture the insights of semantic art understanding. These results are consistent with previous work [27], which shows that text recurrent models perform worse than non-recurrent methods for multi-modal tasks that do not require text generation.

Fig. 4.
figure 4

Qualitative positive results. For each text (i.e. title and comment), the top five ranked images, along with their score, are shown. The ground truth image is highlighted in green. (Color figure online)

6.3 Multi-modal Models for Art Understanding

Finally, we compare the three proposed multi-modal transformation models in the Text2Art challenge: CCA, CML and AMD. For the AMD approach, we use four different attributes to inform the model: Type (AMDT), TimeFrame (AMDTF), School (AMDS) and Author (AMDA). ResNet50 is used to encode visual features. Results are shown in Table 4. Random ranking results are provided as reference. Overall, the best performance is achieved with the CML model and bag-of-words encodings. CCA achieves the worst results among all the models, which suggests that linear transformations are not able to adjust properly to the task. Surprisingly, adding extra information in the AMD models does not lead to further improvement over the CML approach. We suspect that this might be due to the unbalanced number of samples within the classes of the dataset. Qualitative results of the CML model with ResNet50 and bag-of-words encodings are shown in Figs. 4 and 5. In the positive examples (Fig. 4), not only the ground truth painting is ranked within the top five returned images, but also all the images within the top five are semantically similar to the query text. In the unsuccessful examples (Fig. 5), although the ground truth image is not ranked in the top positions of the list, the algorithm returns images that are semantically meaningful to fragments of the text, which indicates how challenging the task is.

Fig. 5.
figure 5

Qualitative negative result. For each text, the ground truth image is shown next to it, along with its ranking position and score. Below, the five top ranked images.

Table 4. Multi-modal transformation models. Comparison between different multi-modal transformation models in the Text2Art challenge.
Table 5. Human Evaluation. Evaluation in both the easy and the difficult sets.

6.4 Human Evaluation

We design a task in Amazon Mechanical TurkFootnote 5 for testing human performance in the Text2Art challenge. For a given artistic text, which includes comment, title, author, type, school and timeframe, human evaluators are asked to choose the most appropriate painting from a pool of ten images. The task has two different levels: easy, in which the pool of images is chosen randomly from all the paintings in test set, and difficult, in which the ten images in the pool share the same attribute type (i.e. portraits, landscapes, etc.). For each level, evaluators are asked to perform the task in 100 artistic texts. Accuracy is measured as the ratio of correct answers over the total number of answers. Results are shown in Table 5. Although human accuracy is considerable high, reaching 88.9% in the easiest set, there is a drop in performance in the difficult level, mostly because images from the same type contain more similar comments than images from different types. We evaluate a CCA and a CML model in the same data split as humans. The CML model with bag-of-words and ResNet50 is able to find the relevant image in the 75% of the samples in the easy set and in the 62% of the cases in the difficult task. There is around ten points of difference between CML model and the human evaluation, which suggests that, although there is still room for improvement, meaningful art representations are being obtained.

7 Conclusions

We presented the SemArt dataset, the first collection of fine-art images with attributes and artistic comments for semantic art understanding. In SemArt, comments describe artistic information of the painting, such as content, techniques or context. We designed the Text2Art challenge to evaluate semantic art understanding as a multi-modal retrieval task, whereby given an artistic text (or image), a relevant image (or text) is found. We proposed several models to address the challenge. We showed that for visual encoding, ResNets perform the best. For textual encoding, recurrent models performed worse than multi-layer preceptron or bag-of-words. We projected the visual and textual encodings into a common multi-modal space using several methods, the one with the best results being a neural network trained with cosine margin loss. Experiments with human evaluators showed that current approaches are not able to reach human levels of art understanding yet, although meaningful representations for semantic art understanding are being learnt.