1 Introduction

This work aims to represent and explore artistic attributes and their relationships in order to improve classification and retrieval of artworks in automatic art analysis. With the large-scale digitisation of art from collections all over the world, computer vision and machine learning have become important tools in the conservation and dissemination of cultural heritage. Some of the most promising work on this direction involves the automatic analysis of paintings, in which computer vision techniques are applied to study the content [12, 47] and the style [9, 45], or to classify the attributes [35, 37] of a specific piece of art.

Automatic analysis of art usually involves the extraction of visual features from digitised artworks by using either handcrafted [5, 28, 49] or deep learning techniques [27, 34, 35, 54]. Visual features, specially the ones extracted from convolutional neural networks (CNNs) [24, 30, 50], have been shown to be very powerful at capturing content [12] and style [9] from paintings, producing outstanding results, for example, on the field of style transfer [45]. However, art specialists rarely analyse artworks as independent and isolated creations, but commonly study paintings within its artistic, historical and social contexts, such as the author influences or the connections between different schools, as illustrated in Fig. 1.

Fig. 1
figure 1

Art as an element in a global context. In Guernica, Pablo Picasso, by means of his own style built upon many artistic influences, such as Cubism or African art, expressed his emotions against war inspired by its historical and political contexts. Image source: www.PabloPicasso.org

To analyse art from a global perspective, we propose to extract context-aware embeddings from paintings by considering both visual and contextual information. For the visual information, we use a standard convolutional neural network, which successfully encodes the content and the style of each sample. On the other hand, for the contextual information, we propose the use of ContextNets, which capture the relationships between the different artistic attributes that are present in the dataset. As context can be acquired from multiple sources, in this work we explore two modalities of ContextNets.

The first modality is based on multitask learning (MTL). We jointly compute several artistic-related tasks together (e.g. author classification, type classification, etc.) and obtain an aggregated loss with the losses of each independent task. By optimising a single aggregated loss, the model is enforced to find common elements and capture relationships between the different artistic attributes. In this type of ContextNet, the context is captured from the visual information, as the only input provided to the system is the painting itself.

In the second modality, in contrast, we use a knowledge graph (KG) to learn the different relationships between artistic attributes. We create an art-specific KG by connecting a set of paintings with their artistic-related attributes. Then, node neighbourhoods and positions within the graph are encoded into a vector to represent context. Whereas the MTL model is able to capture relationships occurring at the visual level, the use of KGs offers a more flexible representation of arbitrary relationships, which might not be well-structured and more difficult to detect when considering visual content only. In any case, we incorporate the information obtained with the aforementioned models into the art analysis system.

The two proposed ContextNets are evaluated on the SemArt dataset [20] in four different art classification tasks and in two cross-modal retrieval tasks. We show that, although none of the proposed modalities show a superior performance with respect to the other one in all of the evaluated tasks, ContextNets consistently outperform methods based on visual embeddings only. Furthermore, our previous work on context-aware embeddings [18] is extended by exploring the representations obtained with our ContextNets and confirming the presence of specific stylistic aspects in the clusters of the high-dimensional embedding space.

1.1 Contributions

The contributions of this work can be summarised as follows:

  • We propose to use specific networks, different from standard visual representation networks, to capture artistic context in paintings.

  • We explore two different modalities of our proposed networks, one based on multitask learning and another one based on knowledge graphs.

  • We investigate the resulting context-aware embeddings with a visualisation tool, finding insights on how the different artistic attributes are clustered in different embedding spaces.

2 Related work

2.1 Automatic art analysis

In order to identify specific attributes in paintings, early work in automatic art analysis was focused on representing the visual content of paintings by designing handcrafted feature extraction methods [5, 25, 28, 37, 49]. For example, [25] proposed to detect authors by analysing their brushwork using wavelet decompositions [28, 49], combined colour, edge, or texture features for author, style, and school classification, and [5, 37] used SIFT features [33] to classify paintings into different attributes.

In the last years, deep visual features extracted from CNNs have been repeatedly shown to be very effective in many computer vision tasks, including automatic art analysis [2, 20, 27, 34, 35, 44, 53, 54]. At first, deep features were extracted from pre-trained networks and used off-the-shelf for automatic art classification [2, 27, 44]. Later, deep visual features were shown to obtain better results when fine-tuned using painting images [8, 35, 47, 53, 54]. Alternatively, [10,11,12] explored domain transfer for object and face recognition in paintings, whereas [20] introduced the use of joint visual and textual models to study paintings from a semantic perspective.

So far, most of the proposed methods in automatic art analysis have focused on representing the visual essence of an artwork by capturing style and/or content. However, the study of art is not only about the visual appearance of paintings, but also about their historical, social, and artistic contexts. In this work, we propose to consider both visual and contextual information in art by introducing ContextNet networks. Although the main focus of this work is on painting classification and retrieval, our findings can be easily applied to other artistic areas [39, 40].

2.2 Multitask learning

Multitask learning models [6] aim to solve multiple tasks jointly with the hope that the generated generic features are more powerful than task-specific representations. In deep learning approaches, MTL is commonly performed via hard or soft parameter sharing [42]. Whereas in hard parameter sharing [6, 48], except by the output layers, parameters are shared between all the tasks, in soft parameter sharing [32, 58], each task is defined by its own parameters, which are encouraged to remain similar via regularisation methods.

Following the success of MTL in many computer vision problems, such as object detection and recognition [3, 43], object tracking [60], facial landmark detection [61], or facial attribute classification [41], we propose a hard parameter sharing MTL approach for obtaining context-aware embeddings in the domain of art analysis. In our approach, by jointly learning related artistic tasks, the resulting visual representations are enforced to capture relationships and common elements between the different artistic attributes, such as author, school, type, or period, and thus, providing contextual information about each painting. In parallel with our work, Strezoski et al. [52] also show outstanding improvements in an art classification dataset by using MTL strategies, which encourage our claim that context is strongly beneficial in automatic art analysis.

2.3 Knowledge graphs

Knowledge graphs are complex graph structures able to capture non-structured relationships between the data represented in the graph. When KGs are used to add contextual information to a multimedia database, prior work has shown consistent improvements in annotation, classification, and retrieval benchmarks [7, 13, 15, 17, 26, 36, 43, 55, 59].

To extract contextual information from a KG, one strategy is to encode relationships from visual concepts detected in pictures, forming concept hierarchies [15, 43]. Johnson et al.  [26] introduced human-generated scene graphs based on descriptions of pictures to improve retrieval tasks, whereas [13] exploited semantic relationships between labels using ConceptNet [51]. Another strategy is to gather labelling information from social media to compute a word-image graph, in which random walks are proposed to extract topological information [59]. Other approaches incorporate the use of external knowledge bases. For example, [17] proposed to improve classifiers with the use of WordNet, [36, 38] designed an end-to-end learning pipeline to incorporate large knowledge graphs, such as Visual Genome [29], into classification, and [55] trained image and graph embeddings using WordNet, NELL [4], or NEIL [7].

While related work mostly relies on the use of external knowledge, in our knowledge graph model, we propose to capture contextual information only by processing the data provided with art datasets. As the semantic of art pieces is extremely domain specific, the symbolism that is implied in mythological or religious representations may not benefit from general knowledge. Instead, we leverage on metadata information from art datasets to create a domain-specific knowledge graph, from which we train context embeddings without any task-specific supervision.

Fig. 2
figure 2

Overview of the multitask learning ContextNet

3 Multitask learning ContextNet

In the MTL ContextNet, artistic context is obtained by finding visual relationships between common elements in different artistic attributes. To compute context-aware embeddings, the model is trained to learn multiple artistic tasks jointly, so the generated embeddings are enforced to find visual similarities between the different tasks.

Formally, in a multitask learning problem, given T learning tasks, with the training setting for the tth task consisting of \(N_t\) training samples and denoted as \(\{\mathbf {x}^t_j,y^t_j\}^{N_t}_{j=1}\), where \(\mathbf {x}^t_j \in {\mathbb {R}}^d\) and \(y^t_j\) are the jth training sample and its label, respectively, the goal is to optimise:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\{\mathbf {w}^t\}^T_{t=1}} \sum ^T_{t=1} \sum ^{N_t}_{j=1} \lambda ^t \ell _t(f(\mathbf {x}_j^t;\mathbf {w}^t),y_j^t) \end{aligned}$$

where f is a function parameterised by the vector \(\mathbf {w}^t\), \(\ell _t\) is the loss function for the tth task, and \(\lambda ^t\), with \(\sum ^T_{t=1}\lambda _t = 1\), weights the contribution of each task.

In our model, we aim to distinguish between the context-aware information and the task-specific data. We define the function parameters for the tth task as the contribution of two vectors, \(\mathbf {w}^t = [\mathbf {w}_g^t; \mathbf {w}_s^t]\), so that f is defined as:

$$\begin{aligned} f(\mathbf {x}_j^t;\mathbf {w}^t) = f_s(\mathbf {v}_j^t;\mathbf {w}_s^t) \end{aligned}$$


$$\begin{aligned} \mathbf {v}_j^t = f_g(\mathbf {x}_j^t;\mathbf {w}_g^t) \end{aligned}$$

here \(f_g\) is a context-aware function parametrised by \(\mathbf {w}_g^t\), \(f_s\) is a task-specific function parametrised by \(\mathbf {w}_s^t\), and \(\mathbf {v}_j^t\) is the jth context-aware embedding generated by task t.

By sharing both the training data and the context-aware parameters across all the tasks as \(\mathbf {x}_j^t = \mathbf {x}_j^k \text { and } \mathbf {w}_g^t = \mathbf {w}_g^k \text { for } j \ne k\), the context-aware embedding \(\mathbf {v}_j^t\) is defined as:

$$\begin{aligned} \mathbf {v}_j = f_g(\mathbf {x}_j;\mathbf {w}_g) \end{aligned}$$

which enforces \(\mathbf {v}_j\) to encode \(\mathbf {x}_j\) in a generic and non-task-specific representation by identifying patterns and relationships within different tasks. The problem, finally, is formulated as:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\mathbf {w}_g,\{\mathbf {w}_s^t\}^T_{t=1}} \sum ^T_{t=1} \sum ^{N}_{j=1} \lambda ^t \ell _t(f_s(f_g(\mathbf {x}_j;\mathbf {w}_g);\mathbf {w}_s^t),y_j^t) \nonumber \\ \end{aligned}$$

For solving this optimisation problem, we propose the model in Fig. 2, in which the T learning tasks correspond to multiple artistic classification challenges, such as type, school, timeframe, or author classification. To obtain context-aware embeddings, the context-aware function, \(f_g\), is characterised by ResNet50 [24] after removing the last fully connected layer, whereas the task-specific functions, \(f_s\), are described by a fully connected layer followed by a ReLU nonlinearity. The output of \(f_g\) is a 2048-dimensional embedding, which is the input of the task-specific classifiers. Each classifier produces a \(C_t\)-dimensional task-specific embedding as output, \(\mathbf {z}_j^t\), where \(C_t\) is the number of classes in each task. Each tasks is formulated with the cross-entropy loss function as:

$$\begin{aligned} \ell _t(\mathbf {z}_j^t,y_j^t) = -\log \Bigg ( \frac{\exp (\mathbf {z}_j^t[y_j^t])}{\sum _c \exp (\mathbf {z}_j^t[c]) } \Bigg ) \end{aligned}$$

where \(\mathbf {z}_j^t = f_s(f_g(\mathbf {x_j};\mathbf {w}_g);\mathbf {w}_s^t)\).

4 Knowledge graph ContextNet

In the MTL ContextNet, contextual information is provided by the painting images themselves by considering the relationships between common elements in the visual appearance of the images when multiple artistic tasks are trained together. In the knowledge graph ContextNet (KGM), in contrast, contextual information is obtained from capturing relationships in an artistic knowledge graph built with non-visual artistic metadata.

4.1 Artistic knowledge graph

A KG is a graph structure, \(G=(V,E)\), in which the entities and their relations are represented by a collection of nodes, V, and edges, E, respectively. We use a KG to capture contextual knowledge and similarities in the semantic space formed by the graph, often referred to as homophily [21].

To construct an artistic KG, one strategy is to connect paintings with edges when sharing a common attribute \(a \in A\), where A is a collection of artistic attributes. However, the complexity of this approach is expensive, reaching the order of \(|V|^2 \times |A|\). Instead, we propose to connect paintings with their attributes in a much sparser manner. We consider multiple types of node: paintings, \(P \subseteq V\), which represent the paintings themselves (e.g. Girl with a Pearl Earring), and each family, \(\psi \), of attributes \(A_\psi \subseteq V\), which represent artistic concepts (e.g. a type such as Portrait or an author such as Van Gogh). We use the training data from the SemArt dataset [20], which contains 19,244 paintings labelled with the attributes Author, Title, Date, Technique, Type, School, and Timeframe to connect edges, \(e=(V_p,V_q) \in E\), between painting nodes, \(V_p\), and attribute nodes, \(V_q \in A_\psi \), with \(\psi \in \{{\textit{Type}, \textit{Timeframe}, \textit{Author}}\}\), when an attribute exists in a painting. As School corresponds to an author’s school, we connect an edge, \(e=(V_a,V_s) \in E\), between an author, \(V_a\), and a school, \(V_s\). We additionally enrich our graph with three other families of attributes, which are connected to painting nodes. From Technique, we extract Material, such as oil, and Support, such as \(210 \times 80\) cm. Also, by computing the most common n-grams in the titles, with n up to three, we extract keywords from the title of each painting, such as Three Graces. In total, the resulting KG presents 33,148 nodes and 125,506 edges, with 3166 authors, 618 materials, 26 schools, 8899 supports, 22 timeframes, 10 types, and 1163 keyword nodes. An example representation of our artistic graph is shown in Fig. 3.

Fig. 3
figure 3

An example of our artistic KG. Each node corresponds to either a painting or an artistic attribute, whereas edges correspond to existing interconnections

Fig. 4
figure 4

Overview of the knowledge graph ContextNet

4.2 Training

At training time, visual and context embeddings are computed from the painting image and from the KG, respectively, and used to optimise the weights of the model. Our training model is depicted in Fig. 4, and each of its parts are detailed below.

Visual embeddings Visual embeddings represent the visual appearance of paintings, containing information about the content and the style of the artwork. To obtain the visual embeddings, we use a ResNet50 [24] without the last fully connected layer.

Context embeddings Context embeddings encode the artistic context of an artwork by extracting data from the KG. For encoding the KG information into a vector representation, we adopt the node2vec model [22] because of its capacity to preserve a trade-off between homophily and structural equivalences, resulting in high performances in node classification tasks [21]. To capture node embeddings, node2vec operates skip-grams over random walks in the KG and associates a vector representing the neighbourhood and the overall position of each node in the graph.

Classifier The classifier takes as input the visual embedding and predicts the artistic attributes contained in the sample painting. We use different kinds of attribute classifiers, such as type, school, timeframe, or author. The classifier is composed of a fully connected layer followed by a ReLU nonlinearity, and its output is used to compute a classification loss using a cross-entropy loss function:

$$\begin{aligned} \ell _c(\mathbf {z}_j,class_j) = -\log \Bigg ( \frac{\exp (\mathbf {z}_j[class_j])}{\sum _i \exp (\mathbf {z}_j[i]) } \Bigg ) \end{aligned}$$

where \(\mathbf {z}_j\) and \(class_j\) are the output of the classifier and the assigned label of the attribute for the jth training painting, respectively.

Encoder The encoder module, which is composed of a single fully connected layer, is used to project the visual embeddings into the context embedding space. We compute the loss between the projected visual embedding, \(\mathbf {p}_j\), and the context embedding, \(\mathbf {u}_j\), of the j-training sample with a smooth L1 loss function:

$$\begin{aligned} \ell _e(\mathbf {p}_j,\mathbf {u}_j) = \frac{1}{n} \sum _i \delta _{ji} \end{aligned}$$


$$\begin{aligned} \delta _{ji} = {\left\{ \begin{array}{ll} \frac{1}{2}(p_{ji} - u_{ji})^2, &{} \quad \text {if}\ | p_{ji} - u_{ji} | \le 1 \\ | p_{ji} - u_{ji} | - \frac{1}{2}, &{} \quad \text {otherwise} \end{array}\right. } \end{aligned}$$

where \(p_{ji}\) and \(u_{ji}\) the ith elements in \(\mathbf {p}_j\) and \(\mathbf {u}_j\), respectively. To train the KGM, we compute the total loss function of the model as a combination of the losses obtained from the classifier and encoder modules:

$$\begin{aligned} {\mathcal {L}} = \lambda _c \sum _{j=1}^{N} \ell _c(\mathbf {z}_j,class_j) + \lambda _e \sum _{j=1}^{N} \ell _e(\mathbf {p}_j,\mathbf {u}_j) \end{aligned}$$

where \(\lambda _c\) and \(\lambda _e\) are parameters that weight the contribution of the classification and the encoder modules, respectively, and N is the number of training samples.

Whereas the parameters of the context embeddings are learnt without supervision and frozen during the KGM training process, the loss score, \({\mathcal {L}}\), obtained from Equation (9) is backpropagated through the weights of the visual embedding module. This enforces ResNet50 to compute embeddings that are meaningful for artistic classification by decreasing \(\ell _c\), while incorporating contextual information from the knowledge graph by minimising \(\ell _e\).

4.3 ContextNet at test time

At test time, to obtain context-aware embeddings from unseen test samples, painting images are fed into the fine-tuned ResNet50 model. As context embeddings computed directly from the KG cannot be obtained for samples that are not contained as a node, the context embedding and the encoder modules are removed from the test model (Fig. 4).

However, the ResNet50 network has been enforced during the training process (1) to capture relevant visual information to predict artistic attributes and (2) to incorporate contextual data from the KG in the visual embeddings. Therefore, the output embeddings from the fine-tuned ResNet50 are, indeed, context-aware embeddings.

5 Art classification evaluation

We evaluated the two proposed ContextNets in multiple art classification tasks, including author identification and type classification.

5.1 Implementation details

In both of our proposed models, painting images are encoded into a vector representation by using ResNet50 [24] without the last fully connected layer. ReNet50 is initialised with its standard pre-trained weights for image classification, whereas the weights from the rest of the layers are initialised randomly. Images are scaled down to 256 pixels per side and randomly cropped into 224 \(\times \) 224 patches. At training time, visual data are augmented by randomly flipping images horizontally. The size of the embeddings produced by ResNet50 is 2048, whereas the dimensionality produced by node2vec is 128. We use stochastic gradient descent with a momentum of 0.9 and a learning rate of 0.001 as optimiser. The training is conducted in mini-batches of 28 samples, with a patience of 30 epochs. In the MTL ContextNet, \(\lambda _t\) is set to 0.25 for all the tasks, whereas in the KGM ContextNet, \(\lambda _c\) is set to 0.9 and \(\lambda _c\) to 0.1.

5.2 Evaluation dataset

We use the SemArt dataset [20] in our art classification evaluation. The SemArt dataset is a collection of 21,384 painting images, from which 19,244 are used for training, 1069 for validation, and 1069 for test. Each painting is associated with an artistic comment, and the following attributes are: Author, Title, Date, Technique, Type, School and Timeframe. We implement the following four tasks for art classification evaluation.

  • Type classification Using the attribute Type, each painting is classified according to 10 different common types of paintings: portrait, landscape, religious, study, genre, still life, mythological, interior, historical and other.

  • School classification The School attribute is used to assign each painting to one of the schools of art that appear at least in ten samples in the training set: Italian, Dutch, French, Flemish, German, Spanish, English, Netherlandish, Austrian, Hungarian, American, Danish, Swiss, Russian, Scottish, Greek, Catalan, Bohemian, Swedish, Irish, Norwegian, Polish and Other. Paintings with a school different to those are assigned to the class Unknown. In total, there are 25 school classes.

  • Timeframe classification The attribute Timeframe, which corresponds to periods of 50 years evenly distributed between 801 and 1900, is used to classify each painting according to its creation date. We only consider timeframes with at least ten paintings in the training set, obtaining a total of 18 classes, which includes an Unknown class for timeframes out of the selection.

  • Author identification The Author attribute is used to classify paintings according to 350 different painters. Although the SemArt dataset provides 3281 unique authors, we only consider the ones with at least ten paintings in the training set, including an Unknown class for painters not contained in the final selection.

5.3 Baselines

Our models are compared against the following baselines:

  • Pre-trained Networks VGG16 [50], ResNet50 [24] and Res-Net152 [24] with their pre-trained weights learnt in natural image classification. To adapt the models for art classification, we modified the last fully connected layer to match the number of classes of each task. The weights of the last layer were initialised randomly and fine-tuned during training, whereas the weights of the rest of the network were frozen.

  • Fine-tuned Networks VGG16 [50], ResNet50 [24] and Res-Net152 [24] networks were fine-tuned for each art classification task. As in the pre-trained models, the last layer was modified to match the number of classes in each task.

  • ResNet50+Attributes The output of each fine-tuned classification model from above was concatenated to the output of a pre-trained ResNet50 network without the last fully connected layer. The result was a high-dimensional embedding representing the visual content of the image and its attribute predictions. The high-dimensional embedding was input into a last fully connected layer with ReLU to predict the attribute of interest. Only the weights from the pre-trained ResNet50 and the last layer were fine-tuned, whereas the weights of the attribute classifiers were frozen.

  • ResNet50+Captions For each painting, we generated a caption using the captioning model from [57]. Captions were represented by a multi-hot vector with a vocabulary size of 5000 and encoded into a 512-dimensional embedding with a fully connected layer followed by an hyperbolic tangent or tanh activation. The caption embeddings were then concatenated to the output of a ResNet50 network without the last fully connected layer. The concatenated vector was fed into a fully connected layer with ReLU to obtain the prediction.

Table 1 Art classification results on SemArt dataset

5.4 Results analysis

We measured classification performance in terms of accuracy, i.e. the ratio of correctly classified samples over the total number of samples. Results are provided in Table 1. In every task, the best accuracy was obtained when a ContextNet, MTL or KGM, was used. The MTL ContextNet performed slightly better than the KGM in School and Timeframe tasks, whereas the KGM was the best in classifying Type and Author attributes. Unsurprisingly, the pre-trained models obtained the worst results among all the baselines, as they do not present enough discriminative power in the domain of art. Also, there was a clear improvement with respect to pre-trained baselines when the networks were fine-tuned, as already noted in previous work [35, 47, 53, 54]. On the other hand, adding attributes or captions to the visual representations seemed to improve the accuracy, although not in all the scenarios, e.g. Timeframe was better classified with the fine-tuned ResNet50 model than with ResNet50+Attributes or ResNet50+Captions, whereas School was better classified with the fine-tuned ResNet50 than with ResNet50+Captions. This suggests that informing the model with extra information is beneficial. When the data used to inform the model were from a ContextNet, accuracy was boosted, with improvements ranging from 3.16 to 7.3% with respect to fine-tuned networks and from 1.32 to 5.5% with respect to ResNet50+Attributes and ResNet50+Captions.

Fig. 5
figure 5

Examples of the SemArt dataset

6 Art retrieval evaluation

We additionally evaluated the our ContextNets on art retrieval problems by incorporating context-aware embeddings into a cross-modal retrieval algorithm.

Fig. 6
figure 6

ContextNets for cross-modal retrieval in art

6.1 Implementation details

As evaluation protocol, we used the SemArt dataset and its proposed Text2Art challenge, which consists of two cross-modal retrieval tasks: text-to-image and image-to-text. In text-to-image retrieval, given an artistic comment and its attributes, the goal is to find the correct painting within all the test paintings in the dataset. Similarly, in image-to-text retrieval, given a sample painting, the goal is to find the correct comment. Examples of paintings and their comments in the dataset can be seen in Fig. 5. We incorporate our ContextNets in a cross-modal retrieval system as shown in Fig. 6 and described below [19].

Visual encoder Painting images are scaled down to 256 pixels per side and randomly cropped into \(224 \times 224\) patches. Then, paintings are fed into ResNet50, initialised with its standard pre-trained weights, to obtain a 1000-dimensional vector, \(\mathbf {h}_{\text {cnn}}\), from the last convolutional layer. At the same time, paintings are fed into a ContextNet classifier to obtain a c-dimensional vector, \(\mathbf {h}_{\text {att}}\), containing the predicted attributes, with c being the number of output classes in the classifier. The final visual representation, \(\mathbf {h}\), is then computed as \(\mathbf {h} = \mathbf {h}_{\text {cnn}} \oplus \mathbf {h}_{\text {att}}\), where \(\oplus \) is concatenation.

Comment and attribute encoder We encode each comment as a term frequency–inverse document frequency (tf–idf) vector, \(\mathbf {q}_{\text {com}}\), using a vocabulary of size 9708, which is built with the alphabetic words that appear at least ten times in the training set. We encode titles as another tf–idf vector, \(\mathbf {q}_{\text {tit}}\), with a vocabulary of size 9092, which is built with the alphabetic words that appear in the titles of the training set. Additionally, we encode Type, School, Timeframe, or Author attributes using a c-dimensional one-hot vector, \(\mathbf {q}_{\text {att}}\), with c being the number of classes in each attribute. The final joint comment and attributes representation, \(\mathbf {q}\), is computed as \(\mathbf {q} = \mathbf {q}_{\text {com}} \oplus \mathbf {q}_{\text {tit}} \oplus \mathbf {q}_{\text {att}}\).

Cross-modal projections To compute similarities between cross-modal data, the visual representation, \(\mathbf {h}\), and the joint comment and attributes representation, \(\mathbf {q}\), are projected into a common 128-dimensional space using the nonlinear functions \(f_h\) and \(f_q\), respectively. The nonlinear functions are implemented with a fully connected layer followed by tanh activation and a \(\ell _2\)-normalisation. Once projected into the common space, elements are retrieved according to their cosine similarity.

Table 2 Results on the Text2Art challenge

The weights of the retrieval model, except from the ContextNet which is frozen, are trained using both positive (i.e. matching) and negative (i.e. non-matching) pairs of samples with the cosine margin loss function:

$$\begin{aligned} \begin{aligned}{\mathcal {L}}(\mathbf {h}_k, \mathbf {q}_j) = {\left\{ \begin{array}{ll} 1 - \text {sim}(f_h(\mathbf {h}_k), f_q(\mathbf {q}_j)), &{} \quad \text {if } k = j \\ \max (0, \text {sim}(f_h(\mathbf {h}_k), f_q(\mathbf {q}_j)) - \varDelta ), &{} \quad \text {if } k \ne j \end{array}\right. }\end{aligned}\nonumber \\ \end{aligned}$$

where \(\text {sim}\) is the cosine similarity between two vectors and \(\varDelta = 0.1\) is the margin. We use Adam optimiser with learning rate 0.0001.

6.2 Results analysis

Results are reported as median rank (MR) and recall rate at K (R@K), with K being 1, 5, and 10. MR is the value separating the higher half of the relevant ranking position amount all samples, i.e. the lower the better, whereas R@K is the rate of samples for which its relevant image is in the top K positions of the ranking, i.e. the higher the better.

We report results of the proposed cross-modal retrieval model using the following ContextNets: MTL-Type, MTL-Timeframe, MTL-School, MTL-Author, KGM-Type, KGM-School, KGM-Timeframe, and KGM-Author, in which only the specified attribute is used. As a baseline of the proposed model, results when using fine-tuned ResNet152 instead of a ContextNet are also reported. Our methods are compared against previous work: CML [20], which encodes comments and titles without attribute information, and AMD [20], in which attributes are used at training time to learn the visual and textual projections. CML* is a reimplementation of CML with slightly better results.

Results are summarised in Table 2. The KGM-Author model obtained the best results, improving previous state of the art, CML*, by a 37.24% in average. When comparing ContextNets, in agreement with classification results (Table 1), MTL performed better than KGM when using School, whereas KGM was the best in Type and Author attributes. We also noted that concatenating the output of an attribute classifier as proposed (ResNet152, MTL, and KGM models) improved results considerably with respect to AMD. However, we observed a big difference in performance when using the different attributes, being Author and Type the best and the worst ones, respectively. A possible explanation for this phenomenon may lay in the difference on the number of classes of each attribute.

Finally, our best model, KGM-Author, was further compared against human evaluators. In the easy set-up, evaluators were shown an artistic comment, a title, and the attributes Author, Type, School, and Timeframe and were asked to choose the most appropriate painting from a pool of ten random images. In the difficult set-up, however, instead of random paintings, the images shown shared the same attribute Type. Results are provided in Table 3. Our model reached values closer to human accuracy than previous work, outperforming CML by a 10.67% in the easy task and a 9.67% in the difficult task.

Table 3 Comparison against human evaluation

7 Discussion and visualisation

To further understand the quality of our results, we investigate the ability of ContextNets to discern between different contextual cues. We additionally explore the generated embedding space using the knowledge graph as a visualisation tool.

Fig. 7
figure 7

Davies–Bouldin index for each different attribute. The blue and red groups correspond to single task of ResNet152-ft and KGM, respectively. Their best results are reported in both ResNet152-ft and KGM columns of the first group (colour figure online)

Fig. 8
figure 8

Embeddings of paintings projected in Tulip [1] using t-SNE [56]. Each node is a painting, and the colouring is mapped to the Timeframe attribute. There is a good separability of Timeframe values in the node2vec and MTL, as opposed to ResNet152. Each red circled area corresponds to its respective cluster selected for inspection in Fig. 10 (colour figure online)

7.1 Separability of embeddings

We study how well context is captured in different types of embeddings by analysing the separability of artistic attributes in clusters. To estimate the separability between clusters, we applied the Davies–Bouldin index [14], Q, which measures a trade-off between dispersion, \(S_i\), and separation, \(D_{ij}\), of the clusters i and j:

$$\begin{aligned} Q=\frac{1}{k}\sum _{i=1}^{k}\left( \max _{i \ne j} \left( \frac{S_i+S_j}{D_{ij}} \right) \right) \end{aligned}$$

where k is the number clusters, and \(S_i\) and \(D_{ij}\) are computed as:

$$\begin{aligned} S_i = \left( \frac{1}{|C_i|} \sum _{\mathbf {x} \in C_i}^{}{\Vert \mathbf {x} - A_i\Vert ^p}\right) ^{1/p} \qquad D_{ij} = \Vert A_i-A_j\Vert _p \end{aligned}$$

where \(A_i\) the centroid of cluster i of element \(\mathbf {x} \in C_i\) computed using the \(\ell _p\) distance, and \(|C_i|\) the number of elements in \(C_i\).

To compare the different settings, we used the samples from the training set and we applied Q with \(p = 2\) to multiple types of embeddings on different attributes, as reported in Fig. 7. When compared on the same task, the smaller value of Q, the better the cluster separation tends to be. We used Type, School, Timeframe, and Author attributes to compare performances between models. We also included the derived Material and Support attributes, for which none of our models was fine-tuned. Along with Author, these new attributes have the highest dispersion due to their large number of classes, showing the lowest Q values.

Fig. 9
figure 9

The overview of the knowledge graph visualised

The compared embeddings are detailed in Fig. 7. The pre-trained ResNet152 baseline (in green) shows consistently the worst results in most categories, whereas the node2vec baseline trained on our KG (in orange) shows a good trade-off between categories and the best performance on the most complex attributes Author, Material and Support. On average, KGM (in purple) performs the best due to its high quality on each of the Type, School, and Timeframe attributes for which it has been trained. On average, the MTL (in red) shows a comparable performance to the multiple single-task fine-tuned ResNet152 (in blue).

Fig. 10
figure 10

Selected cluster for each different embeddings. Top. Each cluster has been enriched with a knowledge graph and redrawn accordingly. The colour encoding is the following: in dark green, time periods; in light green, type of paintings; in dark blue, author; in light blue: material; in dark orange, support; in light orange, school. Bottom. The list paintings thumbnails for each cluster (colour figure online)

These results rule in favour of the added value that contextual knowledge brought by the KG improves overall performances. We may further confirm this intuition from the 2D-projected embeddings in Fig. 8: while the space represented by pre-trained ResNet152 applied to art does not show any convincing separability, the subspace formed by paintings in the node2vec embeddings shows clear separability and sub-densities. MTL does display such a structure, while being much more fractioned.

7.2 Knowledge graph visualisation

We further investigate the content of these clusters and how they capture abstract concepts of art by using the knowledge graph as a visualisation tool. An overview of the knowledge graph is given in Fig. 9.

We inspect one cluster—i.e. a density in the projected space—per each of the embeddings in Fig. 8. To identify such densities, we first apply a DBScan [46] clustering from the 2D projections.Footnote 1 We obtain 10 clusters for ResNet 152 pre-trained, 106 clusters for node2vec, and 285 clusters for MTL. We further rank the top 10 clusters for each type of embeddings based on the averaged pairwise Euclidean distance of their content, with a minimum size of 100 paintings per cluster. Then, we arbitrarily picked one cluster per type of embedding based on its size and visual appeal.

Table 4 Top degree nodes for each embeddings

To explore each cluster, we construct the knowledge subgraph induced by all the paintings contained in the selected cluster. To reduce the visual clutter, we remove all the knowledge graph nodes of degree 1.Footnote 2 In these mini knowledge graphs, the degree shows the influence of a node in the cluster. We thus mapped their degree on the node size of each node and computed a force-directed layout [23] and then removed overlap [16]. We further used edge bundling to remove the visual clutter induced by too many edges [31]. Results are shown in Fig. 10, using Tulip [1].

Following the selected clusters, we obtained 774 paintings, 261 authors, 83 supports, 14 materials, 11 schools, 9 timeframes, and 7 types in ResNet (Fig. 10d); 297 paintings, 74 authors, 18 supports, 9 materials, 8 schools, 1 timeframe, and 4 types in node2vec (Fig. 10e); and 174 paintings, 65 authors, 7 supports, 7 materials, 1 school, 1 timeframe, and 3 types in MTL (Fig. 10f).

The top nodes ranking by degree are reported in Table 4. As we can see, the ResNet cluster concentrates still-life oil paintings mostly from the seventeenth century from many different authors, among which Dutch and Italian painters are well represented. The node2vec cluster focuses almost exclusively on portraits of the second half of the sixteenth century, mostly oil paintings, among which Italian and Flemish painters are well represented. The MTL cluster focuses almost exclusively on landscapes from the seventeenth century, mostly oil paintings, among which the Dutch masters are well represented. The characteristics of the painting type may be easily confirmed from the paintings in Fig. 10, which shows that both MTL- and node2vec-based embeddings well capture not only the timeframe but also more specific stylistic aspects of the dataset (i.e. in combination with type and school).

8 Conclusions

This work proposed to use ContextNets to capture the relationship between artistic attributes in art classification and retrieval. Two modalities of ContextNet were introduced. The first one, based on multitask learning, captures the relationships between visual artistic elements in paintings, whereas the second one, based on knowledge graphs, encodes the interconnections between non-visual artistic attributes. The reported results showed that context-aware embeddings are beneficial in many automatic art analysis problems, improving art classification accuracy by up to a 7.3% with respect to classification baselines. In cross-modal retrieval tasks, our best model outperformed previous work by a 37.24%. We further investigated the clusters obtained from the context-aware embeddings, revealing that similar stylistic attributes were placed close to each other.