Studies involving deep learning approaches for pattern extraction and recognition in paintings and drawings can be broadly classified according to the tasks performed. These tasks have outlined the following main research trends and directions:
-
Artwork attribute prediction;
-
Information retrieval and artistic influence discovery;
-
Object recognition and detection, including near duplicate detection;
-
Content generation.
To a lesser extent, the following topics have also been addressed in the literature:
-
Artistic to photo-realistic translation;
-
Fake detection;
-
Representativity;
-
Emotion recognition and memorability estimation;
-
Visual question answering;
-
Artwork captioning.
Figure 3 shows the trend of all these topics in terms of the number of papers published and publication year of the articles reviewed. It can be seen that since 2018 there has been an increasing number of publications on these topics, demonstrating the growing interest of the scientific community in digitized painting and drawing tasks. The following sections are devoted to discussing each topic in detail.
Artwork attribute prediction
One of the tasks most frequently faced by researchers in the visual art domain is learning to recognize some artwork attributes (artist, genre, period, etc.) from their visual style. Automatic attribute prediction can support art experts in their work on painting analysis and in organizing large collections of paintings. Furthermore, the widespread diffusion of mobile technology has encouraged the tourism industry to develop applications that can automatically recognize the attributes of an artwork in order to provide visitors with relevant information [56, 57].
Although the concept of visual style is rather difficult to define rigorously, distinct styles are recognizable to human observers and are often evident in different painting schools. Artistic visual styles, such as Impressionism, Romanticism, in fact, are characterized by distinctive features that allow artworks to be grouped according to related art movements. In other words, every artwork has a visual style “idiosyncratic signature” [58] which relates it to other similar works.
The papers investigating this topic can be categorized depending on the use of one single model for each individual attribute prediction or a multi-task model aimed at predicting different attributes simultaneously.
Single-task methods
Thanks to their ability to capture not only colour distribution, but also higher level features related to object categories, features automatically extracted by a CNN can easily surpass traditional hand-crafted features when tackling an artwork attribute prediction task. One of the first works on this topic, namely the research presented by Karayev et al. [59], in fact, showed how a CNN pre-trained on PASCAL VOC [60], i.e. an object recognition and detection dataset, is quite effective in attributing the correct painting school to an artwork. The authors explained this behaviour by observing that object recognition depends on the appearance of the object, so the model learns to reuse these features for image style. In other words, they suggest that style is heavily content-dependent.
As mentioned above, another seminal work in this context is the research presented in [31], in which van Noord et al. proposed PigeoNET, a CNN trained on a large collection of paintings to perform the task of automatic artist association based on visual characteristics. These characteristics can also be used to reveal the artist of a precise area of an artwork, in the case of multiple authorship of the same work. We observe that the classification of the unique characteristics of an artist is a complex task, even for an expert. This can be explained by considering that there can be low inter-variability among different artists and high intra-variability in the style of the same artist.
Saleh and Elgammal [61] developed a model capable of predicting not only style, but also genre and artist, based on a metric learning approach. The goal is to learn similarity measures optimized on the historical knowledge available on the specific domain. After learning the metric, the raw visual features are projected into a new optimized feature space on which standard classifiers are trained to solve the corresponding prediction task. In addition to classic visual descriptors, the authors also used features automatically learned by a CNN. Also Tan et al. [62] focused on the three tasks of style, genre and artist classification, and conducted training on each task individually. Interestingly, they also visualized the neurons’ responses in the genre classification task, highlighting how neurons in the first layer learn to recognize simple features, while, as layers go deeper, neurons learn to recognize more complex patterns, such as faces in portraits.
Cetinic et al. [63] conducted extensive experimentation to investigate the effective transferability of deep representations across different domains. Interestingly, one of their main findings is that fine-tuning networks pre-trained for scene recognition and sentiment prediction yields better performance in style classification than fine-tuning networks pre-trained for object recognition (typically on ImageNet). A similar investigation was recently conducted by Gonthier et al. [64]. The authors used techniques to visualize the network internal representations to provide clues to understand what a network learns from artistic images. Furthermore, they showed that a double fine-tuning involving a medium-sized artistic dataset can improve classification on smaller datasets, even when the task changes.
Chen et al. [65] further advanced research on the use of CNNs for style classification, moving from the observation that different layers in existing deep learning models have different feature responses for the same input image. To take full advantage of the information from different layers, the authors proposed an adaptive cross-layer model that combines responses from both lower and higher layers to capture style. Finally, another contribution was provided by Sandoval et al. [66], who proposed a two-stage image classification approach to improve style classification. In the first stage, the method splits the input image into patches and uses a CNN model to classify the artistic style for each patch. Then, the probability scores given by the CNN are incorporated into a single feature vector that is provided as an individual input to a shallow neural network model that performs the final classification (see Fig. 4). The main intuition of the proposed method is that individual patches work as independent evaluators for different portions of the same image; the final model ensembles those evaluations to make the final decision. As is usually the case in this research, confusion was found between historically similar styles. Hence, we conclude that separating visual styles is still a challenging problem.
Multi-task methods
The methods described above address each prediction task individually. Tackling multiple tasks with a single end-to-end trainable model can help in training efficiency and improve classification performance if there is a correlation between different representations of the same input for different tasks. A popular multi-task method is OmniArt [32]. Basically, it consists of a multi-output CNN model in which there is a shared convolutional base for feature extraction and separate output layers, one for each task. The overall training is carried out by minimizing an aggregated loss obtained as a weighted combination of the separate losses.
A different approach was adopted in Belhi et al. [67] who presented a multi-modal architecture that simultaneously takes both digital images and textual metadata as input. The three-channel image is propagated through the convolutional base of a standard ResNet; some metadata, particularly information on genre, medium and style, are one-hot-encoded and provided as input to a shallow feed-forward network. Higher level visual and textual features are concatenated and used to feed the final classification layer. Results indicate that the multi-modal classification system outperforms the individual classification in most cases.
Garcia et al. [35] have gone a step further by combining a multi-output model trained to solve attribute prediction tasks based on visual features and a second model based on non-visual information extracted from artistic metadata encoded using a knowledge graph (see Fig. 5). In short, a knowledge graph is a complex graph that is capable of capturing unstructured relationships between the data represented in the graph. The second model based on the constructed graph is therefore intended to inject “context” information to improve the performance of the first model. To encode the knowledge graph information into a vector representation, the node2vec model [68] was adopted. Indeed, at test time, the context embeddings obtained by computing the knowledge graph cannot be obtained from samples that have not been included as nodes, so the modules that process this information are thrown away. However, the assumption is that the main classification model was forced to learn how to incorporate some contextual information during training. It is worth noting that the proposed method was successfully used by the authors to perform both classification and retrieval.
Information retrieval and artistic influence discovery
Another task that has attracted attention is finding similarity relationships between artworks of different artists and painting schools. These relationships can help art historians to discover and better understand the influences and changes from an artistic movement to another. Indeed, art experts rarely analyze artworks as isolated creations, but typically study paintings within broad contexts, involving influences and connections among different schools. Traditionally, this kind of analysis is done manually by inspecting large collections of human annotated photos. However, manually searching over thousands of pictures, spanned across different epochs and painting schools, is a very time consuming and expensive process. An automatic support tool would avoid this cumbersome process. More generally, studying how to automatically understand art is a step towards the long-term goal of providing machines with the human aesthetic perception and the ability to semantically interpret images.
This task has been mainly addressed by employing a uni-modal retrieval approach based only on visual features. A different way to look at this problem is to use a multi-modal retrieval approach where computer vision and natural language processing converge towards a unified framework for pattern recognition. These aspects are treated separately in the following subsections.
Uni-modal retrieval
A uni-modal approach to finding similarities among paintings was proposed by Saleh et al. [69], based on traditional hand-crafted features. The authors trained discriminative and generative models for the supervised task of classifying painting style to ascertain which type of features would be most useful in the art domain. Then, once they found the most appropriate features, i.e. those that achieve the highest accuracy, they used these features to judge the similarity between paintings by using distance measures.
A method based on deep learning to retrieve common visual patterns shared among paintings was proposed by Seguin et al. [70]. The authors compared a classic bag-of-words method and a pre-trained CNN in predicting pairs of paintings that an expert considered to be visually related to each other. The authors have shown that the CNN-based method is able to surpass the more classic one. The authors used a supervised approach in which the labels to be predicted were provided manually by human experts.
The manual annotation of images is a slow, error-prone and highly subjective process. Conversely, a completely unsupervised learning method would provide a useful alternative. Gultepe et al. [71], applied an unsupervised feature learning method based on k-means to extract features which were then fed into a spectral clustering algorithm for the purpose of grouping paintings. In line with these ideas, in [72, 73] we have proposed a method aimed at finding visual links among paintings in a completely unsupervised way. The method relies solely on visual attributes automatically learned by a deep pre-trained model, so it can be particularly effective when additional information, such as textual metadata, are scarce or unavailable. Furthermore, a computerized suggestion of influences between artists is obtained by exploiting the graph of painters obtained from the visual links retrieved. The analysis of the network structure provides an interesting insight into the influences between artists that can be considered the result of a historical knowledge discovery process (see Fig. 6).
In [36, 74], we have moved further on this direction by exploiting a deep convolutional embedding framework for unsupervised painting clustering, where the task of mapping the raw input data to an abstract, latent space is jointly optimized with the task of finding a set of cluster centroids in the latent feature space. We observed that when the granularity of clustering is coarse, the model takes into account more general features, mainly related to the artistic style. Conversely, when the granularity is finer, the model begins to use content features and tends to group works regardless of the corresponding style. This abstraction capability could be exploited to find similarities between artworks despite the way they are depicted.
In general, most of the works in the visual art domain adopts a supervised learning approach which, despite accurate results, brings with it the difficulty of having labelled data available. Unsupervised learning has been less studied and we believe it deserves further investigation as a viable alternative to extract useful knowledge from visual data.
Multi-modal retrieval
The first corpus that provides not only painting images and their attributes, but also artistic comments intended to carry out semantic art understanding is the aforementioned SemArt dataset [13]. To evaluate and benchmark this task, Garcia and Vogiatzis designed the Text2Art challenge as a multi-modal retrieval task whose purpose is to assess whether the model is able to match a textual description to the correct painting, and vice-versa. The authors proposed several models that basically share the same working scheme: first, images, descriptions and metadata attributes are encoded into visual and textual embeddings, respectively; then, a multi-modal transformation model is applied to map these embeddings into a common feature space in which a similarity function is used. Although experiments with human evaluators showed that the proposed approaches were unable to achieve art understanding at the human level, the proposed models were able to learn representations that are significant for the task. Indeed, semantic art understanding appears to be a difficult task to solve.
It is worth noting that the same task was pursued by Garcia et al. [35] with the context-based method mentioned above for multi-task attribute prediction.
In a series of papers, Baraldi and colleagues studied methods for aligning text and illustrations in documents, i.e. understanding which parts of a plain text could be related to parts of the illustrations. In [75], in particular, the authors considered the problem of understanding whether a commentary written by an expert on a book, specifically a digitized version of the Borso d’Este Holy Bible, has some parts that refer to miniature illustrations. To tackle this challenging task, the authors proposed to create a shared embedding space, in which visual and textual features are projected and compared using a distance measure. The so-called BibleVSA dataset was proposed in this study.
In [14], the authors promoted research in this domain by extending the task of visual-semantic retrieval to a setting in which the textual domain does not contain exclusively visual sentences, i.e. those that describe the visual content of the work, but also contextual sentences, which describe the historical context of the artwork, its author, the place where the painting is located, and so on. To address this two-challenge task, the authors proposed the aforementioned Artpedia dataset. On this dataset, the authors experimented with a multi-modal retrieval model that jointly associates visual and textual elements, and discriminates between visual and contextual sentences of the same image.
Considering that artistic data are often smaller in size than traditional natural image datasets, the authors extended their previous work by moving to a semi-supervised paradigm, in which knowledge learned on ordinary visual-semantic datasets (source domains) is transferred to the artistic (target) domain [76]. As source domains, the authors used Flickr30k [77] and MS COCO [78], which are composed of natural images and are commonly used to train multi-modal retrieval methods. Instead, the aforementioned BibleVSA and SemArt datasets were used as target domains. Experiments validated the proposed approach and highlighted how the distributions of the target and source domains are significantly separated in the embedding space (Fig. 7). This emphasizes that artistic datasets define a completely new domain compared to ordinary datasets.
Object recognition and detection
Another task often faced by the research community working in this field is finding objects in artworks. Recognizing and detecting objects in artworks can help solve large-scale retrieval tasks as well as support the analyses made by historians. Indeed, art historians are often interested in finding out when a specific object first appeared in a painting or how the representation of an object evolved over time. A pioneering work in this context has been the research reported in [41] by Crowley and Zisserman. They proposed a system that, given an input query, retrieves positive training samples by crawling Google Images on-the-fly. These are then processed by a pre-trained CNN and used together with a pre-computed pool of negative features to learn a real-time classifier. Finally, the classifier returns a ranked list of paintings containing the queried object.
In this context, Cai et al. [42, 79] were among the first to emphasize the importance of addressing the cross-depiction problem in computer vision, i.e. recognizing and detecting objects regardless of whether they are photographed, painted, drawn, and so on. The variance between photos and artworks is greater than either domains when considered alone, so classifiers usually trained on natural images may encounter difficulties when it comes to painting images, due to the domain shift. Given the limitless range of potential depictions of the same object, the authors acknowledge that a candidate solution is not learning the specificity of each representation, but learning the abstraction that different representations share so that they can be recognized independently of their representation.
Crowley and Zisserman [81] have improved their previous work by moving from image-level classifiers, i.e. those that take the overall image as input, to object detection systems, which are assumed to improve detection of small objects which are particularly prevalent in paintings. The results of their experimental study have provided evidence that detectors can find many objects in paintings that would likely have been overlooked by object recognition methods. Westlake et al. [10] showed how a CNN only trained on photos can lead to overfitting; on the contrary, fine-tuning on artworks allows the model to generalize better to other artwork styles. To evaluate their proposal, the authors used the People-Art dataset, purposely realized for the task of detecting people. To push forward research in this direction, Wilber et al. [11] did not focus on people, but proposed to use the aforementioned BAM! dataset, designed to provide researchers with a large benchmark dataset to expand the current state-of-the-art of computer vision to the visual art domain.
More recently, Gonthier et al. [82] focused on more specific objects or visual attributes that may be useful to art historians, such as ruins or nudity, and iconographic characters, such as the Virgin, Jesus. These categories are unlikely to be inherited directly from photographic databases. To overcome this problem, the authors proposed a “weakly supervised” approach that can learn to detect objects based only on image-level annotations. The goal is to detect new, unseen objects with minimal supervision.
A different perspective from which object detection can be viewed in this context is near duplicate detection. This task is not to find distinct instances of a same object class, but to automatically discover nearly identical patterns in different images. In [12], Shen et al. addressed this problem by applying a deep neural network model to a dataset of artworks attributed to Jan Brueghel purposely annotated by the authors. The key technical insight of the method is to adapt a deep standard feature to this task, perfecting it on the specific art collection using self-supervised learning. Spatial consistency between adjacent feature matches is used as a supervisory fine-tuning signal. The fitted function leads to a more accurate style invariant match and can be used with a standard discovery approach, based on geometric verification, to identify duplicate patterns in the dataset. The method is self-supervised, which means that the training labels are derived from the input data. Ufer et al. [83] recently extended this research by presenting a multi-style fusion approach that successfully reduces the domain gap and improves retrieval results in larger collections with a large number of distractors.
As can be realistically supposed, the approaches discussed in this section are more successful with artworks that are “photo-realistic” by nature, but can fail or show degraded performance when used on more abstract styles such as Cubism or Expressionism. Such abstract styles pose serious challenges since the depictions of objects and subjects may show strong individualities and are therefore difficult to represent through generalizable patterns.
Content generation
A central problem in the Artificial Intelligence community is generating art through machines: in fact, making a machine capable of showing creativity on a human level (not only in painting, but also in poetry, music, and so on) is widely recognized as an expression of intelligence. Traditional literature on computational creativity has developed systems for art generation based on the involvement of human artists in the generation process (see, for example, [84]). More recently, the advent of the Generative Adversarial Network paradigm has allowed researchers to develop systems that do not put humans in the loop but make use of previous human products in the learning process. This is consistent with the assumption that even human experts use prior experience and their knowledge of past art to develop their own creativity.
Elgammal et al. [45], proposed CAN (Creative Adversarial Network): a variant of a classic GAN architecture that aims to create art by maximizing deviation from established styles and minimizing deviation from art distribution. In other words, the model “tries to generate art that is novel, but not too novel”. Deviating from established styles is important, as a classic GAN would “emulate” previous data distribution showing limited creativity. The effectiveness of CAN was assessed by involving the judgment of human evaluators, who regularly confused generated art with human art. Examples of images generated by CAN are shown in Fig. 8. Of course, the machine does not have a semantic understanding of the subject, since, as mentioned above, its learning is based only on exposure to the prior art.
Similarly, Tan et al. [85, 86] proposed ArtGAN: a GAN variant in which the gradient information with respect to the label is propagated from the discriminator back to the generator for better learning representation and image quality. Further architectural innovations are introduced in the model generation. Qualitative results show that ArtGAN is capable of generating plausible-looking artworks based on style, artist and genre.
Lin et al. [87] observed that attributes such as artist, genre and period can be crucial as control conditions to guide the painting generation. To this end, they proposed a multi-attribute guided painting generation framework in which an asymmetrical cycle structure with control branches is used to increase controllability during content generation.
Generating content with GANs is a field that has received tremendous interest in recent times. The interested reader can refer to the inspiring paper about style-based generator [88] and GAN applications quite related to art like anime [89] and fashion [90] design.
Other topics
There are a number of topics that have been studied less thoroughly by researchers but which deserve to be mentioned. They are briefly described below.
Artistic to photo-realistic translation
Tomei at al. [91, 92] observed that the poor performance usually provided by state-of-the-art deep learning models is due to the natural images they are pre-trained on: this results in a gap between high-level convolutional features of the natural and artistic domain, whose visual appearance differ greatly. To reduce the shift between feature distributions from the two domains, without the need to re-train the deep models, the authors proposed Art2Real: instead of fine-tuning previous models on the new artistic domain, the method translates artworks directly in photo-realistic images. The model generates natural images by retrieving and learning details from real photos through a similarity matching strategy that leverages a weakly supervised understanding of the scene. Experimental results show that the proposed technique leads to increased realism and reduces the domain shift.
Another way to perform image-to-image translation, particularly in the opposite direction, is to use the well-known neural style transfer technique originally proposed by Gatys et al. [93]. This consists in combining the content of one image with the style of another, typically a painting, to synthesize a combined unedited image. Unfortunately, while effective in transferring artistic styles, this method usually works poorly in the opposite direction, i.e. when asked to translate artworks into photo-realistic images. Since there are many studies exploring how to automatically transform photo-realistic images into synthetic artworks, and some literary reviews, such as [94], have already been done, the topic of neural style transfer is not covered in this paper.
Fake detection
An essential task for art experts is to judge the authenticity of an artwork. Historically, this task was based on the search for detailed “invariant” characteristics of an artist’s style regardless of composition or subject matter. Currently, the analysis of these details is supported by techniques based, among others, on chemical and radiometric analyses. State-of-the-art computer vision techniques have to potential to provide cost-effective alternatives to the sophisticated analyses performed in laboratory settings. To this end, Elgammal et al. [95] proposed a computerized approach to analyzing strokes in artists’ line drawings to facilitate the attribution of drawings without being fooled by counterfeit art. The proposed methodology is based on the quantification of the characteristics of individual strokes by combining different hand-crafted and learned features. The authors experimented with a collection of drawings mainly by Pablo Picasso, Henry Matisse, and Egon Schiele, showing good performance and robustness to falsely claimed works.
Representativity analysis
In a very recent work, Deng et al. [96] have proposed the concept of representativity to quantitatively assess the extent to which a given individual painting can represent the general characteristics of an artist’s creation. To tackle this task, the authors proposed a novel deep representation of artworks enhanced by style information obtained through a weighted pooling feature fusion module. Then, a graph-based learning method is proposed for representativity learning, which considers intra-category and extra-category information. Since historical factors are significant in the art domain, the time of creation of a painting is introduced into the learning process. User studies showed that the proposed approach helps to effectively access artists’ creation characteristics by ordering paintings in accordance with representativity from highest to lowest.
Emotion recognition and memorability estimation
With the rise of visual data online, understanding the feelings aroused in the observer by visual content is gaining more and more attention in research. However, there are few works in the literature that address this challenging task, mainly due to the inherent complexity of the problem and the scarcity of digitized artworks annotated with emotional labels.
Lu et al. [97] proposed an adaptive learning approach to understand the emotional appeal of paintings. The method uses labelled photographs and unlabelled paintings to distinguish positive from negative emotions and to differentiate reactive emotions from non-reactive ones. The learned knowledge of photographs is transferred to paintings by iteratively adapting feature distribution and maximizing the joint likelihood of labelled and unlabelled data.
Cetinic et al. [98, 99], investigated the possibility of using learned visual features to estimate the emotions evoked by art as well as painting memorability, i.e. how easy it is for a person to remember an image. In fact, people have been shown to share a tendency to remember the same images, indicating that memorability is universal in nature and lies beyond our subjective experience [100]. This also indicates that some image features contribute more to memorability than others. The authors used a model trained to predict memorability scores of natural images to explore the memorability of artworks belonging to different genres and styles. Experiments showed that nude and portrait paintings are the most memorable genres, while landscape and marine paintings are the least memorable. As for image style, it turned out that abstract art is more memorable than figurative art. Additionally, an analysis of the correlation between memorability and image features related to composition, colour and the visual sentiment response evoked by abstract images was provided. Results showed that there is no correlation between symmetry and memorability, however memorability is positively correlated with the likelihood of an image to evoke a positive feeling. Their results also suggest that content and image lighting have a significant influence on aesthetics, in which colour vividness and harmony strongly influence the prediction of sentiment, while the emphasis on objects has a strong impact on memorability.
Visual question answering
Recently, Garcia at al. [54] have built, on top of the previously proposed SemArt, the AQUA dataset, which aims to be a preliminary benchmark for visual question answering in the art domain. This refers to the task of providing the computer vision system with a text-based question and the system should give to the user an answer. Baseline results have been presented by the authors with a two-branch model, in which visual and knowledge questions are handled independently.
Artwork captioning
Cetinic recently noted, in [55], that while image captioning has been extensively studied in recent literature, little work has been done in the art domain. Image captioning refers to the task of recognizing objects and their relationships in an image to generate syntactically and semantically correct textual descriptions of the image. She conducted an experiment with state-of-the-art methods finding out that it is possible to generate meaningful captions from art, which show strong relevance to the historical-artistic context of the image.
Summary
A schematic overview of the studies reviewed in this paper is provided in Table 2. The studies are ordered in chronological order to provide a final historical perspective on the research topic and, for each of them, the main task, the method used and the results obtained are summarized.
Table 2 Summary of the studies reviewed. (Note that in the case of partially overlapping works by the same authors, only one article is reported here)