Art Authentication with Vision Transformers

In recent years, Transformers, initially developed for language, have been successfully applied to visual tasks. Vision Transformers have been shown to push the state-of-the-art in a wide range of tasks, including image classification, object detection, and semantic segmentation. While ample research has shown promising results in art attribution and art authentication tasks using Convolutional Neural Networks, this paper examines if the superiority of Vision Transformers extends to art authentication, improving, thus, the reliability of computer-based authentication of artworks. Using a carefully compiled dataset of authentic paintings by Vincent van Gogh and two contrast datasets, we compare the art authentication performances of Swin Transformers with those of EfficientNet. Using a standard contrast set containing imitations and proxies (works by painters with styles closely related to van Gogh), we find that EfficientNet achieves the best performance overall. With a contrast set that only consists of imitations, we find the Swin Transformer to be superior to EfficientNet by achieving an authentication accuracy of over 85%. These results lead us to conclude that Vision Transformers represent a strong and promising contender in art authentication, particularly in enhancing the computer-based ability to detect artistic imitations.

hoc feature extraction methods (including fractal analysis, wavelet coefficients, and edge detection) to represent visual artistic features such as brushstrokes, followed by a machine learning model trained on such features to distinguish the works of the artist from possibly similar works by other artists [4,5,6,7,8].The excellent pattern-recognition abilities of Convolutional Neural Networks (CNNs) have led to a new wave of studies showing impressive performances on art-classification tasks [9,10,11] and many other visual tasks [12,13,14].These studies involve complex CNN architectures that are trained on large digitized art collections, generally adding to the CNN a last dense (fully-connected) layer.The last layer feeds into a single output neuron in case of art authentication or into N output neurons for art attribution to one of N artists [15].
It should be acknowledged that computer-based art attribution and authentication are not without their limitations and challenges.The first group of limitations stems from the digital nature of the images used in this technique.These images might have deformations and loss of information because of factors such as image resolution, lighting conditions, camera type, and post-processing compression rate.The second group of limitations pertains to connoisseurship.Previous works [16,17] have discussed the role of the machine as a new type of art expert responsible for attributions and authentications.Bell and Offert (2021) [16] have highlighted important similarities between human and machine connoisseur approaches, such as knowledge of numerous works by the same artist and related works.However, there are noteworthy differences that constitute limitations of the computer-based techniques.While the computer relies solely on optical information (images), the human connoisseur also considers contextual information, including but not limited to historical knowledge, provenance, and scientific results.While most of the early studies have primarily focused on traditional machine learning for art-attribution tasks [18,19,5,6], our paper delves into the more specific task of art authentication, using Vincent van Gogh as a case study.Our paper aims to perform a comparative evaluation of Vision Transformers (ViTs) [20,21,22] and CNNs on the art-authentication task, and determine the level of performance that can be attained on this challenging task.

Selection of architectures
As we are interested in a comparison between previous state-of-the-art CNN-based methods and ViTs, we have to select representatives of both types of methods.To select a CNN architecture, we determine the best-performing architecture on art-classification tasks by relying on a sample of studies performed over the last 10 years.Although the selected studies have been performed with different methods and datasets, and mostly focused on art attribution instead of art authentication, their performances provide a clear sign of the best-performing architecture.Table 1 lists the performances and performance measures for five representative studies over the last 10 years.The performances fall within a limited range, 78 − 91%.The best-performing study of Table 1 [11] made use of the ResNet101 architecture [23].Hence, we select ResNet101 as one of the CNN architectures for our experiments.As will be motivated in Section 3.3, we include another CNN architecture called EfficientNet [24] in our selection.For the ViTs we will rely on two variants of a state-of-the-art architecture called Swin Transformer [21].
The outline of the rest of the paper is as follows.Section 2 reviews CNNs and ViTs, highlighting the architectures used in this study, ResNet101, EfficientNet and the Swin Transformer.Section 3 details the experimental procedure, and Section 4 presents the results.Section 5 ends the paper with a conclusion and discussion of future work.
stages learn the residual function F (x) = H(x) − x by using skip connections.The use of skip connections allows ResNets to excel due to their increased depth.EfficientNet represents a class of CNN models introduced by Tan and Le [24].These models are optimised by scaling the width, depth, and input resolution of CNNs with a fixed ratio.
EfficientNets have demonstrated superior performance compared to ResNets on image classification tasks.In our experiments, we use the variants ResNet-101 and EfficientNetB5.
Vision transformers are relatively new deep learning architectures that have gained considerable attention and popularity in the computer vision community [20].They represent a departure from traditional CNNs by replacing the typical convolutional layers with attention mechanisms [29].In linguistic tasks, the introduction of an attention mechanism facilitated the encoding of long-range contextual information, which led to exceptional results on a wide range of tasks [30].Recent breakthrough performances of GPT4 [31] and related large language models are due to the power of Transformers.One of the main advantages of ViTs is their ability to capture relatively long-range dependencies within an image, which is essential for a wide range of computer vision tasks.This is achieved through the attention mechanism, which allows the model to attend to any region of the image when making predictions, rather than being limited to a fixed image context, like CNNs.ViTs have achieved state-of-the-art results on several image classification benchmarks, including ImageNet, and have shown promising results on other tasks, such as object detection and semantic segmentation.
The Swin Transformer was recently proposed as a generic Transformer-based backbone for computer vision [21,22].The basic architecture is hierarchical and employs an efficient self-attention mechanism using shifting windows.Its hierarchical architecture allows for capturing multi-scale relations and its shifting windows mitigate the growth of computational complexity with image size.Figure 1 illustrates a four-stage Swin Transformer, the so-called "Swin-Tiny" variant.The input comprises an image of size H × W × 3 which is partitioned into patches of size W/4 × H/4 × 3 (the rectangle labeled "Patch Partition").Each patch is embedded into a "token" of size H/4 × W/4 × C by means of a linear layer ("Linear Embedding"), where C is an arbitrary dimensionality parameter of the Swin architecture.The token is fed into the building block of the Swin Transformer ("SWIN Transformer Pair"), the inner structure of which is illustrated in Figure 2. The first block consist of layer normalisation, multi-head attention, layer normalisation, and two multilayer perceptrons.The multi-head (self-)attention is applied within non-overlapping M × M windows of the input token (M = 7).The curved arrows represent skip connections.The second block is identical to the first, but applies attention to shifted M × M windows.

Experiments
This section specifies the experiments by discussing the van Gogh dataset (3.1), the data preparation and augmentation (3.2), the specific CNN and Swin architectures and their hyperparameter settings (3.3), and our evaluation procedure (3.4).

Van Gogh dataset
Our dataset for the authentication task was carefully collected and consists of 654 images of authentic paintings (authentic set) and 669 or 137 images of non-authentic ones (depending on the type of contrast set).The resolutions of the images of artworks vary from one reproduction to another.In what follows, we outline the authentic set and two versions of the contrast set: the "standard contrast set" and the "refined contrast" set.As will be described in Section 4, the development of the refined contrast is motivated by the results on the standard contrast set, which reveal that art authentication requires a more constrained selection of artworks in the contrast set.The composition of each contrast set is described below.

Authentic set
When compiling our authentic set, we have used the standard 'La Faille' Catalogue Raisonné [32] as a reference, meaning that all authentic images used for training are recorded there.Moreover, we have removed from the authentic set the images whose authenticity is questioned by contemporary experts.This approach enables us to mitigate the risk of accidentally introducing fake artworks into the original dataset (label noise).The careful crafting of the authentic set distinguishes this work from previous ones, which are usually trained on images downloaded from WikiArt [33] (a less reliable source as compared to the established Catalogue Raisonné).

Contrast set
As art authentication involves a binary classification task, we carefully compile a second set that serves as a contrast to the authentic works.This secondary set consists of negative examples, i.e., artworks that are not attributed to van Gogh.

Standard contrast set
The standard contrast set features 69 imitations: 10 copies by followers of van Gogh such as Vik Muniz, Blanche Derousse and Jamini Roy; 40 imitations in van Gogh's style; and 21 known forgeries, including 8 produced by the famous forger Wacker [34,35].In addition, to achieve a balance with the authentic set, the standard contrast set also incorporates 600 proxies which are paintings by contemporary artists who utilized techniques and styles similar to those of van Gogh -mainly Post-Impressionism, Cloisonnism, and Japonism.The main proxy artists are Paul Cézanne (114 images), Henri de Toulouse-Lautrec (48 images), Maurice Prendergast (47 images), and Henri Matisse (47 images).

Refined contrast set
Including proxies in the standard contrast set introduces painting styles that differ greatly from those of van Gogh.Hence, for the construction of our refined contrast set, we remove all proxies and gather additional imitations from auction archives.We include 68 additional images that were cataloged as being inspired by van Gogh: 50 images are described as After Vincent van Gogh, 14 are in Manner of Vincent van Gogh, 2 are Attributed to Vincent van Gogh, 1 is Circle Vincent van Gogh and 1 is Follower Vincent van Gogh.Table 2 shows the composition of the refined contrast set relative to the standard contrast set.

Data preparation
The dataset consists of sub-images of paintings, i.e., RGB images normalized to a fixed size of 256 × 256 pixels, and the channel values normalized to the unit interval.The sub-images are created by dividing the whole image into 2 p × 2 p equally sized units, with p depending on the resolution of the original image as follows: p = 2, if the smaller side of an image is larger than 1024 pixels, and p = 1, if the smaller side is larger than 512 pixels and smaller than 1024.For all images, regardless of the resolution, we also include the sub-image of the center-cropped square stemming from the full image.Figure 3 exemplifies the generation of 16 squared, center-cropped patches from an authentic van Gogh painting.This patching method allows the models to extract very fine-grained brushstroke level information from the smaller patches, but also more compositional and representational features from the full patch and the larger patches.Some of the examined architectures require an input size of 224 × 224 pixels.In that case the original 256 × 256 sub-images were downsampled using bicubic resampling.To emphasize the importance of imitations over proxies, in the standard contrast set, we assign sample weights w im to the imitations.In preliminary experiments, we found that w im = 10, showing that imitations weight ten times more than proxies, yields the best results.This value will remain consistent across the experiments conducted using the standard contrast set described in this study.In the refined contrast set, we did not employ sample weighting, setting w im = 1.
We evaluate each model in N = 20 experiments.In each experiment, we randomly assign the paintings including their constituent patches, to the training, validation, and test partitions.These random assignments result in N training, validation, and test partitions.Each model is trained and evaluated on exactly the same N partitions.This ensures that each architecture is trained and evaluated in the same manner, which enables a fair comparative evaluation.Table 3 lists the compositions of the partitions in terms of the number of images for the authentic and contrast sets.In each experiment, a randomly selected subset of authentic images of approximately the same size as the size of contrast images is used for training.Because we subdivide each image into patches, the actual number of patches in each partition is much larger.For instance, for the experiments with the standard contrast set, the actual numbers vary slightly (because images differ in their number of patches): About fifteen thousand patches in the training set and two thousand patches in the validation and test partition each.We emphasize that all patches of each painting are always assigned to the same partition.As a consequence, the test set always consists of patches that were not part of the training or validation partitions.

Architectures and training procedure
The recent outstanding art-classification results reported by Dobbs and Ras [11], as discussed in Section 1.2, has led us to choose ResNet101, the 101-layer version of ResNet [23], as a representative CNN for our van Gogh authentication task.Although ResNet101 represents the state-of-the-art in art classification, it may not be the most robust CNN available.Therefore, to provide a more comprehensive evaluation, we include another CNN in our analysis that better represents the class of modern CNNs: EfficientNet [24].Specifically, we select EfficientNetB5, as its complexity (measured by the number of parameters) roughly matches that of the simplest Swin Transformer.For our experiments, we utilize two variations of Swin Transformers (Swin Tiny and Swin Base), with the detailed description of the Swin Transformer architecture provided in Section 2.
Using the standard contrast set, the four architectures examined are: EfficientNetB5, ResNet101, Swin-Tiny and a larger version called Swin-Base.The latter is included to determine the potential beneficial effect of this larger Swin Transformer variant.EfficientNetB5 has 28M parameters, ResNet101 44.7M parameters, Swin-Tiny and Swin-Base have 28M and 88M parameters, respectively.All architectures are pretrained on ImageNet [25].ResNet101, EfficientNetB5, and Swin-Tiny are pre-trained on the 1K version of ImageNet, whereas Swin-Base is pre-trained on the 22K version of ImageNet.
In preliminary experiments we explored three variants of transfer learning: (i) freezing the base architecture and training a new top layer (the standard method of transfer learning), (ii) initially freezing the base, training the new top layer, and subsequently training the base and top with a small learning rate, and (iii) unfreezing all layers and training the entire architecture with a small learning rate.It turned out that variant (iii) gave the best results for all architectures, which is in line with previous findings in art classification [36,15].Hence, in contrast to what is typical to transfer learning, we employed variant (iii), where the top was defined as a randomly-initialized dense layer.For the initialization, we use "He normal" initialization [37] that ensures that the random weight values do not saturate the receiving neurons' activations.To this end, the w n values of the weights feeding into a neuron are drawn from a (truncated) normal distribution with µ = 0 and σ = (2/w n ).For all architectures, training is performed with binary cross-entropy as loss function, the Adam optimizer, batch size 32, learning rate 0.0001, early stopping (patience = 20 epochs and minimum delta = 0.001), and imitation-sample weights w im = 10.
For the experiments with the refined contrast set, we apply the same training procedure but do not use imitation-sample weights and restrict ourselves to EfficientNetB5 and Swin-Tiny.The motivation for focusing on these two architectures is twofold: (i) both architectures perform best in the experiments with the standard contrast set, and (ii) comparing the performances of these architectures is fair due to their almost equal parameter complexity.

Evaluation procedure
For each architecture, we performed N = 20 experiments, and report the average prediction accuracies for individual patches and for the entire paintings.The latter is determined for each artwork by taking the mean of the predictions of its constituent patches, including the sub-image with a center-cropped square stemming from the full image.To further understand the model's performance, we present accuracies per class, distinguishing between the authentic and the contrast classes.Additionally, within the contrast set, we provide separate accuracies for proxies and imitations.

Results
In this section, we present separately the results for the experiments conducted with both the standard contrast set and the refined contrast set.

Results for the standard contrast set
Table 4 reports the results obtained with the standard contrast set.For each of the examined architectures (with the pretraining variants mentioned in Section 3.3), it lists the mean accuracy for the patches and the entire paintings, as well as the number of parameters for each architecture.From these results, we draw three observations.The main observation is that EfficientNetB5 yields the best artauthentication performance, both on patches and on entire images.We reiterate that in terms of the number of parameters, EfficientNetB5 has roughly the same complexity and initialization as Swin-Tiny (i.e., 28M and ImageNet 1K, respectively), which makes it a fair comparison.The second observation is that both Swin architectures yield a considerable improvement in performance regarding ResNet101, i.e. accuracies ≈ 0.89 − 0.90 roughly matching the performances listed in Table 1 in Section 1.2.The third observation is that although the Swin-Base Transformer performs marginally better than the Swin-Tiny Transformer on patches, it does not result in a better performance on paintings.At first sight, these results suggest EfficientNetB5 outperforms ResNet101 and the Swin Transformers on art-authentication tasks.However, a closer examination of the results for the constituents of the standard contrast set leads to a different view.
Table 5 lists the accuracies for the authentic and standard contrast sets, as well as the two constituent types of contrast artworks: imitations and proxies.The results show the performances obtained by all architectures mainly reflect a successful separation of authentic paintings and artworks by proxies, given that both have accuracies of more than 90%.On the other side, the performance on the imitations is considerably lower, despite the use of sample weights.We acknowledge that the task of distinguishing imitations from originals is a much more complex and fine-grained one, than distinguishing proxies from originals.Proxies are artworks created by known artists in their own style, albeit similar to the style of van Gogh, while the imitations (including copies and forgeries) contain only artworks that were created, explicitly or implicitly in the style of van Gogh, with a clear and close emulation of the artist.Thus, this last category contains artworks with a much higher degree of similarity to the authentic ones.
Clearly, art authentication requires a fine distinction between imitations and authentic art.Hence, the poor performance on the imitations motivated the development of the refined contrast set.The results of our experiments with the refined contrast set are the subject of the next section.

Results for the refined contrast set
As mentioned in Section 3.3, we trained the two comparable architectures on the van Gogh dataset by using a refined contrast set which only comprises imitations.We did not use sample weights for these experiments.The obtained results are presented in Tables 6 and 7. Table 6 shows the accuracies for paintings in the authentic and refined contrast sets.We observe that in this case, a much better balance is achieved between the performances on the authentic and contrast artworks.This applies especially to Swin-Tiny, which outperforms EfficientNetB5 and achieves the best overall performance.The much-improved performance on the imitations also suggests the relative improvement of the second dataset with respect to the first, as this second dataset tackles best the core of art authentication: the separation between authentic works and reproductions.Alongside this reasoning, the superior performance of Swin-Tiny on this second dataset suggests a non-negligible improvement over state-of-the-art CNNs.shows the percentages of patches predicted correctly and incorrectly by both architectures.While they agree on the majority of correctly-classified patches (79%), their disagreement is limited to smaller percentages (7.4% and 5.7%).
Both architectures incorrectly classify a slightly larger percentage (8.2%) of patches.The differences in correctly predicted artworks suggest that avenues combining the strengths of both models may yield even better performance.In this sense, art authentication may benefit a little from a hybrid CNN-ViT approach that combines the strengths of both architectures.
Figure 4 illustrates the differences between both architectures in terms of the distributions of their patch predictions.The histograms for EfficientNetB5 and Swin Tiny are shown in the left and right columns, respectively.The top row displays the incorrect patch predictions, and the bottom row the correct ones.The top left histogram shows a relatively large number of occurrences of wrong predictions in the interval 0.5-0.7 for EfficientNetB5, i.e., the first peak right from the middle.These indicate false positives, revealing a bias toward classifying patches as authentic.Such a peak is not evident for Swin Tiny, although there are more "confident" false predictions at 0 and 1 (see the top right histogram).
Comparing the bottom two histograms, showing the correct predictions, it is clear that Swin Tiny (right histogram) has a much larger number of very confident predictions (near 0 and 1), than EfficientNetB5.These illustrations reveal the subtle ways in which both types of architectures (CNN and Vision Transformer) differ in the realisation of their predictions.To what extent these differences are algorithm-specific is unclear and subject to further investigations.

Conclusion and future work
We performed a comparative evaluation of CNNs and Vision Transformers.We found EfficientNetB5 outperforms the Swin-Tiny and Swin-Base Transformers on the standard contrast set, by favoring the classifying of proxies over the classifying of imitations.In our example, this shows that EfficientNetB5 is better able to distinguish between van Gogh and his contemporaries than both Swin Transformers.The Swin-Tiny Transformer was shown to be marginally superior to EfficientNetB5 on a refined contrast set (containing imitations only) that better reflects the essence of art authentication.For the Swin-Tiny Transformer, the change in contrast set was associated with a jump in imitation-classification accuracy from 0.53 for the standard contrast set to 0.84 on the refined contrast set.
While further tests should be carried out to determine the generalizability of these results to other artists' datasets, we also highlight how the deep learning approach to art authentication has an inherent superiority in terms of generalizability to all feature engineering approaches mentioned in Section 1.1, as they require little hyperparameter tuning and do not rely on an isolated feature (i.e.brushstroke) which may not be visible in all artists.

Figure 2 :
Figure 2: Illustration of the inner structure of the Swin Transformer Pair.Based on [21].

Figure 3 :
Figure 3: Left side: original image.Right side: 4 × 4 grid and center cropped squared patches highlighted in bright regions.The image shows the Oliviers avec ciel jaune et soleil.Collection: Minneapolis Institute of Art, Location: Minneapolis Institute of Art, oil on canvas.Availability: public domain, CR number: F710.

Figure 4 :
Figure 4: Histograms of the patch predictions for EfficientNetB5 (ENB5, left) and Swin Tiny (SwinT, right).The top rows shows the distribution of incorrect prediction values, the bottom row those of the correct ones.Note that the top and bottom rows have different vertical scales.

Table 1 :
Overview of performances in art classification over the last 10 years.

Table 2 :
Composition of the standard and refined contrast sets.The proxies for the standard contrast set consist of artists that painted in the same styles as van Gogh, whereas for the refined contrast set the imitations are expanded and include imitations from auction records.

Table 3 :
Number of images in each of the three partitions for the experiments with the standard and refined contrast set.

Table 4 :
Overview of the results obtained with the standard contrast set.For each architecture, the accuracies averaged over N = 20 experiments and standard deviations are listed for the patches and entire images.

Table 5 :
Painting-based test accuracies for the authentic and standard contrast sets, and the two types of contrast types: imitations and proxies.

Table 6 :
Painting-based test accuracies for authentic and refined contrast sets, by using the EfficientNetB5 and Swin-Tiny architectures.

Table 7
lists the mean accuracy, precision, and recall for EfficientNetB5 and Swin-Tiny.The latter scores best on all three metrics, showing that Swin-Tiny exhibits the best painting-based authentication performance.

Table 7 :
Painting-based accuracies, precision, and recall per type of artwork, with the refined contrast set.Pre-training and number of parameters as indicated in Table4.See text for details.

Table 8 provides
insight into the degree of overlap of patch predictions made by both architectures.The confusion table

Table 8 :
Confusion table showing the percentages of correct and incorrect patch predictions for Swin-Tiny and EfficientNetB5 on the enhanced contrast set.