Keywords

1 Introduction

Several recent works show that recycling analysis tools that have been developed for natural images (photographs) can yield surprisingly good results for analysing paintings or drawings. In particular, impressive classification results are obtained on painting databases by using convolutional neural networks (CNNs) designed for the classification of photographs [10, 55]. These results occur in a general context were methods of transfer learning [14] (changing the task a model was trained for) and domain adaptation (changing the nature of the data a model was trained on) are increasingly applied. Classifying and analysing paintings is of course of great interest to art historians, and can help them to take full advantage of the massive artworks databases that are built worldwide.

More difficult than classification, and at the core of many recent computer vision works, the object detection task (classifying and localising an object) has been less studied in the case of paintings, although exciting results have been obtained, again using transfer techniques [11, 28, 52].

Methods that detect objects in photographs have been developed thanks to massive image databases on which several classes (such as cats, people, cars) have been manually localised with bounding boxes. The PASCAL VOC [17] and MS COCO [34] datasets have been crucial in the development of detection methods and the recently introduced Google Open Image Dataset (2M images, 15M boxes for 600 classes) is expected to push further the limits of detection. Now, there is no such database (with localised objects) in the field of Art History, even though large databases are being build by many institutions or academic research teams, e.g. [16, 38, 39, 43, 44, 53]. Some of these databases include image-level annotations, but none includes location annotations. Besides, manually annotating such large databases is tedious and must be performed each time a new category is searched for. Therefore, it is of great interest to develop weakly supervised detection methods, that can learn to detect objects using image-level annotations only. While this aspect has been thoroughly studied for natural images, only a few studies have been dedicated to the case of painting or drawings.

Moreover, these studies are mostly dedicated to the cross depiction problem: they learn to detect the same objects in photographs and in paintings, in particular man-made objects (cars, bottles ...) or animals. While these may be useful to art historians, it is obviously needed to detect more specific objects or attributes such as ruins or nudity, and characters of iconographic interest such as Mary, Jesus as a child or the crucifixion of Jesus, for instance. These last categories can hardly be directly inherited from photographic databases.

For these two reasons, the lack of location annotations and the specificity of the categories of interest, a general method allowing the weakly supervised detection on specific domains such as paintings would be of great interest to art historians and more generally to anyone needing some automatic tools to explore artistic databases. We propose some contributions in this direction:

  • We introduce a new multiple-instance learning (MIL) technique that is simple and quick enough to deal with large databases,

  • We demonstrate the utility of the proposed technique for object detection on weakly annotated databases, including photographs, drawings and paintings. These experiments are performed using image-level annotations only.

  • We propose the first experiments dealing with the recognition and detection of iconographic elements that are specific to Art History, exhibiting both successful detections and some classes that are particularly challenging, especially in a weakly supervised context.

We believe that such a system, enabling one to detect new and unseen category with minimal supervision, is of great benefit for dealing efficiently with digital artwork databases. More precisely, iconographic detection results are useful for different and particularly active domains of humanities: Art History (to gather data relative to the iconography of recurrent characters, such as the Virgin Mary or San Sebastian, as well as to study the formal evolution of their representations), Semiology (to infer mutual configurations or relative dimensions of the iconographic elements), History of Ideas and Cultures (with category such as nudity, ruins), Material Culture Studies, etc.

In particular, being able to detect iconographic elements is of great importance for the study of spatial configurations, which are central to the reading of images and particularly timely given the increasing importance of Semiology. To fix ideas, we can give two examples of potential use. First, the order in which iconographic elements are encountered (e.g. Gabriel and Mary), when reading an image from left to right, has received much attention from art historians [20]. In the same spirit, recent studies [5] on the meaning of mirror images in early modern Italy could benefit from the detection of iconographic elements.

2 Related Work

Object Recognition and Detection in Artworks. Early works on cross-domain (or cross-depiction) image comparisons were mostly concerned with sketch retrieval, see e.g. [12]. Various local descriptors were then used for comparing and classifying images, such as part-based models [46] or mid-level discriminative patches [2, 9]. In order to enhance the generalisation capacity of these approaches, it was proposed in [54] to model object through graphs of labels. More generally, it was shown in [25] that structured models are more prone to succeed in cross-domain recognition than appearance-based models.

Next, several works have tried to transfer the tremendous classification capacity of convolutional neural networks to perform cross-domain object recognition, in particular for paintings. In [10], it is shown that recycling CNNs directly for the task of recognising objects in paintings, without fine-tuning, yields surprisingly good results. Similar conclusions were also given in [55] for artistic drawings. In [32], a robust low rank parametrized CNN model is proposed to recognise common categories in an unseen domain (photo, painting, cartoon or sketch). In [53], a new annotated database is introduced, on which it is shown that fine-tuning improves recognition performances. Several works have also successfully adapted CNNs architectures to the problem of style recognition in artworks [3, 31, 36]. More generally, the use of CNNs opens the way to other artwork analysis tasks, such as visual links retrieval [45], scene classification [19], author classification [51] or possibly to generic artwork content representation [48].

The problem of object detection in paintings, that is, being able to both localise and recognise objects, has been less studied. In [11], it is shown that applying a pre-trained object detector (Faster R-CNN [42]) and then selecting the localisation with highest confidence can yield correct detections of PASCAL VOC classes. Other works attacked this difficult problem by restricting it to a single class. In [22], it is shown that deformable part model outperforms other approaches, including some CNNs, for the detection of people in cubist artworks. In [40], it is shown that the YOLO network trained on natural images can, to some extend, be used for people detection in cubism. In [52], it is proposed to perform people detection in a wide variety of artworks (through a newly introduced database) by fine-tuning a network in a supervised way. People can be detected with high accuracy even though the database has very large stylistic variations and includes paintings that strongly differs from photographs in the way they represent people.

Weakly supervised detection refers to the task of learning an object detector using limited annotations, usually image-level annotations only. Often, a set of detections (e.g. bounding boxes) is considered at image level, assuming we only know if at least one of the detection corresponds the category of interest. The corresponding statistical problem is referred to as multiple instance learning (MIL) [13]. A well-known solution to this problem through a generalisation of Support Vector Machine (SVM) has been proposed in [1]. Several approximations of the involved non-convex problem have been proposed, see e.g. [21] or the recent survey [6].

Recently, this problem has been attacked using classification and detection neural networks. In [47], it is proposed to learn a smooth version of an SVM on the features from R-CNN [23] and to focus on the initialisation phase which is crucial due to the non-convexity of the problem. In [41], it is proposed to learn to detect new specific classes by taking advantage of the knowledge of wider classes. In [4] a weakly supervised deep detection network is proposed based on Fast R-CNN [24]. Those works have been improved in [50] by adding a multi-stage classifier refinement. In [8] a multi-fold split of the training data is proposed to escape local optima. In [33], a two step strategy is proposed, first collecting good regions by a mask-out classification, then selecting the best positive region in each image by a MIL formulation and then fine-tuning a detector with those propositions as “ground truth” bounding boxes. In [15] a new pooling strategy is proposed to efficiently learn localisation of objects without doing bounding boxes regression.

Weakly supervised strategies for the cross domain problem have been much less studied. In [11], a relatively basic methodology is proposed, in which for each image the bounding box with highest (class agnostic) “objectness” score is classified. In [28], it is proposed to do mixed supervised object detection with cross-domain learning based on the SSD network [35]. Object detectors are learnt by using instance-level annotations on photographs and image-level annotations on a target domain (watercolor, cartoon, etc.). We will perform comparisons of our approach with these two methods in Sect. 4.

3 Weakly Supervised Detection by Transfer Learning

In this section, we propose our approach to the weakly supervised detection of visual category in paintings. In order to perform transfer learning, we first apply Faster R-CNN [42] (a detection network trained on photographs) which is used as a feature extractor, in the same way as in [11]. This results in a set of candidate bounding boxes. For a given visual category, the goal is then, using image-level annotations only, to decide which boxes correspond to this category. For this, we propose a new multiple-instance learning method, that will be detailed in Sect. 3.1. In contrast with classical approaches to the MIL problem such as [1] the proposed heuristic is very fast. This, combined with the fact that we do not need fine-tuning, permits a flexible on-the-fly learning of new category in a few minutes.

Figure 1 illustrates the situation we face at training time. For each image, we are given a set of bounding boxes which receive a label +1 (the visual category of interest is present at least once) or −1 (the category is not present in this image).

Fig. 1.
figure 1

Illustration of positive and negative sets of detections (bounding boxes) for the angel category.

3.1 Multiple Instance Learning

The usual way to perform MIL is through the resolution of a non-convex energy minimisation [1], although efficient convex relaxations have been proposed [29]. One disadvantage of these approaches is their heavy computational cost. In what follows, we propose a simple and fast heuristic to this problem.

For simplicity of the presentation, we assume only one visual category. Assume we have N images at hand, each of which contains K bounding boxes. Each image receives a label \(y = +1\) when it is a positive example (the category is present) and \(y = -1\) otherwise. We denote by \(n_1\) the number of positive examples in the training set, and by \(n_{-1}\) the number of negative examples.

Images are indexed by i, the K regions provided by the object detector are indexed by k, the label of the i-th image is denoted by \(y_i\) and the high level semantic feature vector of size M associated to the k-th box in the i-th image is denoted \(X_{i,k}\). We also assume that the detector provides a (class agnostic) “objectness” score for this box, denoted \(s_{i,k}\).

We make the (strong) hypothesis that if \(y_{i} = +1 \), then there is at least one of the K regions in image i that contains an occurrence of the category. In a sense, we assume that the region proposal part is robust enough to transfer detections from photography to the target domain.

Following this assumption, our problem boils down to the classic multiple-instance classification problem [13]: if for image i we have \(y_i=+1\), then at least one of the boxes contains the category, whereas if \(y_i=-1\) no box does. The goal is then to decide which boxes correspond to the category. Instead of the classical SVM generalisation proposed in [1] and based on an iterative procedure, we look for an hyperplan minimising the functional defined below. We look for \(w \in \mathbf {R}^M\), \(b\in \mathbf {R}\) achieving

$$\begin{aligned} min_{(w,b)} \mathcal {L} (w,b) \end{aligned}$$
(1)

with

$$\begin{aligned} \phi (w,b) = \sum _{i=1}^{N} \frac{-y_{i}}{n_{y_i}} Tanh \left\{ \max _{ k \in \{1..K \} } \left( w^{T} X_{i,k} + b \right) \right\} \end{aligned}$$
(2)

and

$$\begin{aligned} \mathcal {L} (w,b) =\phi (w,b) + C * ||w||^2, \end{aligned}$$
(3)

where C is a constant balancing the regularisation term. The intuition behind this formulation is that minimising \(\mathcal {L} (w,b) \) amounts to seek a hyperplan separating the most positive element of each positive image from the least negative element of the negative image, sharing similar ideas as in MI-SVM [1] or Latent-SVM [18]. The \(\mathop {Tanh}\) is here to mimic the SVM formulation in which only the worst margins count. We divide by \(n_{y_i}\) to account for unbalanced data. Indeed most example images are negative ones (\(n_{-1}\gg n_{1})\)).

The main advantage of this formulation is that it can be realised by a simple gradient descent, therefore avoiding costly multiple SVM optimisation. If the dataset is too big to fit in the memory, we switch to a stochastic gradient descent by considering random batches in the training set.

As this problem is non-convex, we try several random initialisation and we select the couple wb minimising the classification function \(\phi (w,b)\). Although we did not explore this possibility it may be interesting to keep more than one vector to describe a class, since one iconographic element could have more that one specific feature, each stemming from a distinctive part.

In practice, we observed consistently better results when modifying slightly the above formulation by considering the (class-agnostic) “objectness” score associated to each box (as returned by Faster R-CNN). Therefore we modify function \(\phi \) to

$$\begin{aligned} \phi ^s(w,b) = \sum _{i=1}^{N} \frac{-y_{i}}{n_{y_i}} Tanh \left\{ \max _{k \in \{1..K\} } \left( \left( s_{i,k} + \epsilon \right) \left( w^{T} X_{i,k} + b \right) \right) \right\} \end{aligned}$$
(4)

with \( \epsilon \ge 0\). The motivation behind this formulation is that the score \(s_{i,k}\), roughly a probability that there is an object (of any category) in box k, provides a prioritisation between boxes.

Once the best couple (\(w^{\star },b^{\star }\)) has been found, we compute the following score, reflecting the meaningfulness of category association:

$$\begin{aligned} S(x) = Tanh \lbrace \left( s(x) + \epsilon \right) \left( w^{\star T} x + b^{\star } \right) \rbrace \end{aligned}$$
(5)

At test time, each box with a positive score (5) (where s(x) is the objectness score associated to x) is affected to the category. The approach is then straightforwardly extended to an arbitrary number of categories, by computing a couple \((w^{\star },b^{\star })\) per category. Observe, however, that this leads to non-comparable scores between categories. Among all boxes affected to each class, a non-maximal suppression (NMS) algorithm is then applied in order to avoid redundant detections. The resulting multiple instance learning method is called MI-max.

3.2 Implementation Details

Faster R-CNN. We use the detection network Faster R-CNN [42]. We only keep its region proposal part (RPN) and the features corresponding to each proposed region. In order to yield and efficient and flexible learning of new classes, we choose to avoid retraining or even fine-tuning. Faster R-CNN is a meta-network in which a pre-trained network is enclosed. The quality of features depends on the enclosed network and we compare several possibility in the experimental part.

Images are resized to 600 by 1000 before applying Faster R-CNN. We only keep the 300 boxes having best “objectness” scores (after a NMS phase), along with their high-level featuresFootnote 1. An example of extracted boxes is shown in Fig. 2. About 5 images per second can be obtained on a standard GPU. This part can be performed offline since we don’t fine-tune the network.

As mentioned in [30], residual network (ResNet) appears to be the best architecture for transfer learning by feature extractions among the different ImageNet models, and we therefore choose these networks for our Faster R-CNN versions. One of them (denoted RES-101-VOC07) is a 101 layers ResNet trained for the detection task on PASCAL VOC2007. The other one (denoted RES-152-COCO) is a 152 layers ResNet trained on MS COCO [34]. We will also compare our approach to the plain application of these networks for the detection tasks when possible, that is when they were trained on classes we want to detect. We refer to these approaches as FSD (fully supervised detection) in our experiments.

For implementation, we build on the TensorflowFootnote 2 implementation of Faster R-CNN of Chen et al. [7]Footnote 3.

Fig. 2.
figure 2

Some of the regions of interest generated by the region proposal part (RPN) of Faster R-CNN.

MI-max. When a new class is to be learned, the user provides a set of weakly annotated images. The MI-max framework described above is then run to find a linear separator specific to the class. Note that both the database and the library of classifiers can be enriched very easily. Indeed, adding an image to the database only requires running it through the Faster R-CNN network and adding a new class only requires a MIL training.

For training the MI-max, we use a batch size of 1000 examples (for smaller sets, all features are loaded into the GPU), 300 iterations of gradient descent with a learning rate of 0.01 and \(\epsilon =0.01\) (4). The whole process takes 750 s for 20 classes on PASCAL VOC07 trainval (5011 images) with 12 random start points per class, on a consumer GPU (GTX 1080Ti). Actually the random restarts are performed in parallel to take advantage of the presence of the features in the GPU memory since the transfer of data from central RAM to the GPU memory is a bottleneck for our method. The 20 classes can be learned in parallel.

For the experiments of Sect. 4.3, we also perform a grid search on the hyper-parameter C (3) by splitting the training set into training and validation sets. We learn several couples (wb) for each possible value of C (different initialisation) and the one that minimises the loss (4) for each class is selected.

4 Experiments

In this section, we perform weakly supervised detection experiments on different databases, in order to illustrate different assets of our approach.

In all cases, and besides other comparisons, we compare our approach (MI-max) to the following baseline, which is actually the approach chosen for the detection experiments in [11] (except that we do not perform box expansion): the idea is to consider that the region with the best “objectness” score is the region corresponding to the label associated to the image (positive or negative). This baseline will be denoted as MAX. Linear-SVM classifier are learnt using those features per class in a one-vs-the-rest manner. The weight parameter that produces the highest AP (Average Precision) score is selected for each class by a cross validation methodFootnote 4 and then a classifier is retrained with the best hyper-parameter on all the training data per class. This baseline requires to train several SVMs and is therefore costly.

At test time, the labels and the bounding boxes are used to evaluate the performance of the methods in term of AP par class. The generated boxes are filtered by a NMS with an Intersection over Union (IoU) [17] threshold of 0.3 and a confidence threshold of 0.05 for all methods.

4.1 Experiments on PASCAL VOC

Before proceeding with the transfer learning and testing our method on paintings, we start with a sanity check experiment on PASCAL VOC2007 [17]. We compare our weakly supervised approach, MI-max, to the plain application of the fully supervised Faster R-CNN [42] and to the weakly supervised MAX procedure recalled above. We perform the comparison using two different architectures (for the three methods), RES-101-VOC07 and RES-512-COCO, as explained in the previous section.

Table 1. VOC 2007 test Average precision (%) Comparison of the Faster R-CNN detector (trained in a fully supervised manner: FSD) and our MI-max algorithm (trained in a weakly supervised manner) for two networks RES-101-VOC07 and RES-152-COCO.

As shown in Table 1 our weakly supervised approach (only considering annotations at the image levelFootnote 5) yields performances that are only slightly below the ones of the fully supervised approach (using instance-level annotations). On the average, the loss is only 1.1% of mAP when using RES-512-COCO (for both methods). The baseline MAX procedure (used for transfer learning on paintings in [10]) yields notably inferior performances.

4.2 Detection Evaluation on Watercolor2k and People-Art Databases

We compare our approach with two recent methods performing object detection in artworks, one in a fully supervised way [52] for detecting people, the other using a (partly) weakly supervised method to detect several VOC classes on watercolor images [28]. For the learning stage, the first approach uses instance-level annotations on paintings, while the second one uses instance-level annotations on photographs and image-level annotations on paintings. In both cases, it is shown that using image-level annotations only (our approach, MI-max) only yields a light loss of performances.

Experiment 1: Watercolor2k. This database, introduced in [28], and available onlineFootnote 6, is a subset of watercolor artworks from the BAM! database [53] with instance-level annotations for 6 classes (bike, bird, dog, cat, car, person) that are included in the PASCAL VOC, in order to study cross-domain transfer learning. On this database, we compare our approach to the methods from [28] and from [4], to the baseline MAX discussed above, as well as to the classical MIL approach MI-SVM [1] (using a maximum of 50 iterations and no restarts).

In [28], a style transfer transformation (Cycle-GAN [56]) is applied to natural images with instance-level annotation. The images are transferred to the new modality (i.e. watercolor) in order to fine-tune a detector pre-trained on natural images. This detector is used to predict localisation of objects on watercolor images annotated at the image level. The detector is then fine-tuned on those images in a fully supervised manner. Bilen and Vedaldi [4] proposed a Weakly Supervised Deep Detection Network (WSDDN), which consists in transforming a pre-trained network by replacing its classification part by a two streams network (a region proposal stream and a classification one) combined with a weighted MIL pooling strategy.

Table 2. Watercolor2k (test set) Average precision (%). Comparison of the proposed MI-max method to alternative approaches.

From Table 2, one can see that our approach performs clearly better than the other ones using image-level annotations only ([4], MAX, MI-SVM). We also observe only a minor degradation of average performances (54.3% versus 48.9%) with respect to the method [28], which is retrained using style transfer and instance-level annotations on photographs.

Experiment 2: People-Art. This database, introduced in [52], is made of artistic images and bounding boxes for the single class person. This database is particularly challenging because of its high variability in styles and depiction techniques. The method introduced in [52] yields excellent detection performances on this database, but necessitates instance-level annotations for training. The authors rely on Fast R-CNN [24], of which they only keep the three first layers, before re-training the remaining of the network using manual location annotations on their database.

In Table 3, one can see that our approach MI-max yields detection results that are very close to the fully supervised results from [52], despite a much lighter training procedure. In particular, as already explained, our procedure can be trained directly on large, globally annotated database, for which manually entering instance-level annotations is tedious and time-costly.

Table 3. People-Art (test set) Average precision (%). Comparison of the proposed MI-max method to alternative approaches.

4.3 Detection on IconArt Database

In this last experimental section, we investigate the ability of our approach to learn and detect new classes that are specific to the analysis of artworks, some of which cannot be learnt on photographs. Typical such examples include iconic characters in certain situations, such as Child Jesus, the crucifixion of Jesus, Saint Sebastian, etc. Although there has been a recent effort to increase open-access databases of artworks by academia and/or museums workforce [10, 16, 31, 36,37,38, 44, 48], they usually don’t include systematic and reliable keywords. One exception is the database from the Rijkmuseum, with labels based on the IconClass classification system [27], but this database is mostly composed of prints, photographs and drawings. Moreover, these databases don’t include the localisation of objects or characters.

In order to study the ability of our (and other) systems to detect iconographic elements, we gathered 5955 painting images from WikicommonsFootnote 7, ranging from the 11th to the 20th century, which are partially annotated by the WikidataFootnote 8 contributors. We manually checked and completed image-level annotations for 7 classes. The dataset is split in training and test sets, as shown in Table 4. For a subset of the test set, and only for the purpose of performance evaluation, instance-level annotations have been added. The resulting database is called IconArtFootnote 9. Example images are shown in Fig. 3. To the best of our knowledge, the presented experiments are the first investigating the ability of modern detection tools to classify and detect such iconographic elements in paintings. Moreover, we investigate this aspect in a weakly supervised manner.

Table 4. Statistics of the IconArt database
Fig. 3.
figure 3

Example images from the IconArt database. Angel on the first line, Saint Sebastian on the second. We can see some of the challenges posed by this database: tiny objects, occlusions and large pose variability.

To fix ideas on the difficulty of dealing with iconographic elements, we start with a classification experiment. For this, we use the same classification approach as in [10], using InceptionResNetv2 [49] as a feature extractorFootnote 10. We also perform classification-by-detection experiments, using the previously described MAX approach (as in [11]) and our approach, MI-max. In both cases, for each class, the score at the image level is the highest confidence detection score for this class on all the regions of the image. Results are displayed in Table 5. First, we observe that classification results are very variable depending on the class. Classes such as Jesus Child, Mary or crucifixion have relatively high classification scores. Others, such as Saint Sebastian, are only scarcely classified, probably due to a limited quantity of examples and a high variability of poses, scales and depiction styles. We can also observe that, as mentioned in [11], the classification by detection can provide better scores than global classification, possibly because of small objects, such as angels in our case. Observe that these classification scores can probably be increased using multi-scale learning (as in [51]), augmentation schemes and an ensemble of networks [11].

Table 5. IconArt classification test set classification average precision (%).
Table 6. IconArt detection test set detection average precision (%). All methods based on RES-152-COCO.

Next, we evaluate the detection performance of our method, first with a restrictive metric: AP per class with an IoU \(\geqslant \) 0.5 (as in all previous detection experiments in this paper), then with a less restrictive metric with IoU \(\geqslant \) 0.1. Results are displayed in Table 6. Results on this very demanding experiment are a mixed-bag. Some classes, such as crucifixion, and to a less extend nudity or Jesus Child are correctly detected. Others, such as angel, ruins or Saint Sebastian, hardly get it up to 15% detection scores, even when using the relaxed criterion IoU \(\geqslant \) 0.1. Beyond a relatively small number of examples and very strong scale and pose variations, there are further reasons for this:

  • The high in-class depiction variability (for Saint Sebastian for instance)

  • The many occlusions between several instances of a same class (angel)

  • The fact that some parts of an object can be more discriminative than the whole object (nudity).

Illustrations of successes and failures are displayed, respectively on Figs. 4 and 5. On the negative examples, one can see that often a larger region than the element of interest is selected or that a whole group of instances is selected instead of a single one. Future work could focus on the use of several couples (wb) instead of one to prevent those problems.

Fig. 4.
figure 4

Successful examples using our MI-max-C detection scheme. We only show boxes whose scores are over 0.75.

Fig. 5.
figure 5

Failure examples using our MI-max-C detection scheme. We only show boxes whose scores are over 0.75.

5 Conclusion

Results from this paper confirm that transfer learning is of great interest to analyse artworks databases. This was previously shown for classification and fully supervised detection schemes, and was here investigated in the case of weakly supervised detection. We believe that this framework is particularly suited to develop tools helping art historians, because it avoids tedious annotations and opens the way to learning on large datasets. We also show, in this context, experiments dealing with iconographic elements that are specific to Art History and cannot be learnt on natural images.

In future works, we plan to use localisation refinement methods, to further study how to avoid poor local optima in the optimisation procedure, to add contextual information for little objects, and possibly to fine-tune the network (as in [15]) to learn better features on artworks. Another exciting direction is to investigate the potential of weakly supervised learning on large databases with image-level annotations, such as the ones from the Rijkmuseum [44] or the French Museum consortium [43].