1 Introduction

Writer retrieval is of core interest for historians, librarians, paleographers [2], law enforcement [3], or fraud prevention [4]. Due to the inherent scale of finding individual authors in a sea of reference documents and the extreme complexity of determining whether two documents were written by the same person, automated mechanisms for finding the correct writers have to be employed to solve writer retrieval at any meaningful scale. In contemporary writer retrieval methodologies, the core principle revolves around the aggregation of independent patch embeddings derived from documents. However, prevailing aggregation mechanisms tend to be relatively straightforward, often relying on basic techniques such as mean- or max-pooling operations [5]. In the pursuit of advancing writer retrieval capabilities, this study undertakes an investigation into the potential of learned feature aggregators.

The primary aim of this research is to delve into the efficacy of complex architectures in comparison to conventional aggregation approaches. In this quest, we undertake a comprehensive analysis of two distinct architectures: the NetVLAD [6] architecture, which is commonly used to merge different views of the same data point, e. g., the same place during different weather conditions [6], and the sophisticated Transformer architecture [7], whose attention-mechanism is capable of achieving state-of-the-art performance in several different domains.

Both NetVLAD and the Transformer architecture undergo rigorous training across an array of objectives spanning self-supervised, supervised, and metric learning. This multifaceted approach ensures a comprehensive evaluation of their capabilities. Despite the extensive training and the strong performance witnessed in diverse applications, our findings across both architectures reveal a persistent challenge. The achieved performance consistently falls short of the prevailing state-of-the-art benchmarks, thereby suggesting an intrinsic limitation associated with naive learned feature aggregators.

The lack of performance is particularly extreme in the NetVLAD case: While the transformer only lacks behind the state-of-the-art by \(\approx 3.0\%\) top-1 accuracy, NetVLAD drops by nearly 20 percentage points. Considering the relative simplicity of current feature-aggregators and the fact that both the transformer and the NetVLAD method could fall back to simple mean-pooling suggests inherent issues with the degrees of freedom afforded by training the feature aggregation layer alongside the feature extractor.

By shedding light on the limitations of learned feature aggregators in patch-based writer retrieval, our research contributes to the existing body of knowledge in the field of writer retrieval.

In the subsequent sections, we present a detailed analysis of our experiments, including the different model configurations, augmentations, and learning objectives that were explored. We evaluate the quality of writer retrieval achieved by transformers in each case and discuss the implications of our findings. Through this comprehensive investigation, we aim to provide a deeper understanding of the challenges and potential strategies for leveraging transformers as feature aggregators in the context of patch-based writer retrieval.

2 Background

2.1 The transformer architecture

Transformers have emerged as the leading methods in natural language processing (NLP), computer vision, and multi-modal applications thanks to their remarkable ability to capture complex relationships and dependencies in data. Specifically relevant to writer retrieval, transformers also allow for inputs of different lengths, which is essential for any general feature aggregator.

These models operate through the utilization of two fundamental types of layers: multi-headed self-attention and dense feedforward blocks. Self-attention is a pivotal component of transformers, and it functions by establishing connections between all elements within a given set of inputs, referred to as X. This connection-building process is accomplished by comparing and relating each token to every other token in the set. The information flow is then determined based on the similarity of these tokens. The calculation is executed using the following equation

$$\begin{aligned} {\text {attention}}(Q,K,V) = {\text {softmax}}\left( \frac{QK^T}{\sqrt{d_k}}\right) V. \end{aligned}$$
(1)

In this equation, the Q, K, and V matrices represent linear projections of the original input set X, \(d_k\) corresponds to the dimension of the QKV output matrices [7]. The softmax function is employed to assign relative weights to the connections, which are then used to compute the weighted sum of the values in the V matrix. One possible interpretation of this is a routing based on a fully connected graph that aggregates information based on the weight \(q_ik_j\) between the nodes i and j. This weighted aggregation view suggests that self-attention is an interesting candidate for aggregating other types of information, such as the information present in different image patches which make up the features of modern writer retrieval processes.

Additionally, transformers employ dense feedforward blocks, which consist of two feedforward layers that operate on individual tokens. This architectural design aims to extract further information from the aggregated features obtained through self-attention. Specifically, feedforward layers have the form

$$\begin{aligned} {\text {MLP}}(x) = {\text {Feedforward}}\left( \sigma \left( {\text {Feedforward}}(x)\right) \right) \end{aligned}$$

for each individual token \(x\in X\) independently. \(\sigma \) is an arbitrary activation functions such as leaky-ReLU [8], or SwiGLU [9]. In most works the interior dimension between the feedforwards is much higher (two to eight times) than the exterior dimension, forming an inverted bottleneck. The transformer architecture consists of alternating application of a self-attention layer followed by the two-layer MLP described above, with LayerNorms [10] before every block. For more details, we refer to [7].

2.2 NetVLAD

NetVLAD [6] is a soft-assign version of the original VLAD [11] that is amenable to end-to-end training. Fundamentally, the idea of both NetVLAD and VLAD is to cluster N embeddings \(e_1,\dots ,e_N\in \mathbb {R}^D\) into K clusters \(\mu _1,\dots ,\mu _K\in \mathbb {R}^D\) and computing the residual to that cluster.

$$\begin{aligned} {\text {VLAD}}(e, \mu )&=\left( \sum _{i=1}^N \mathbbm {1}_{e_i\approx \mu _1} (e_i - \mu _1),\dots ,\right. \\ {}&\quad \left. \sum _{i=1}^N\mathbbm {1}_{e_i\approx \mu _K} (e_i - \mu _K)\right) . \end{aligned}$$

This process produces a \(K\times D\) matrix, which, notably is independent of the number of descriptors initially put into the model. For writer retrieval specifically, this can be seen as generating a global view by marginalizing all local views \(e_i\).

The idea of NetVLAD is to replace the hard cluster assignments \(\mathbbm {1}_{e_i\approx \mu _K}\) with soft cluster assignments from a softmax

$$\begin{aligned}&{\text {NetVLAD}}(e, \mu )\\&\qquad \quad = \left( \sum _{i=1}^N \alpha _1(e_i) (e_i - \mu _1),\dots , \sum _{i=1}^N\alpha _K(e_i) (e_i - \mu _K)\right) ,\\&\text {where}\\&\quad \alpha _j(e_i) = \frac{\exp (-\alpha \Vert e_i-\mu _j\Vert _2^2)}{\sum _{k}\exp (-\alpha \Vert e_i-\mu _{k}\Vert _2^2)}. \end{aligned}$$

NetVLAD additionally chooses to parameterize a linear model \(w_k = 2\alpha c_k\) and \(b_k = -\alpha \Vert c_k\Vert ^2\) which yields

$$\begin{aligned} \alpha _j(e_i) = \frac{\exp (w_j^Tx_i+b_j)}{\sum _{k}\exp (w_k^Tx_i+b_k)}, \end{aligned}$$

allowing for efficient training. One can interpret this as replacing the original vectors with the mean position relative to the cluster centers.

It is worth noting that NetVLAD can be seen as a restricted version of attention, where one replaces the \(L_2\) norm \(\exp (-\alpha \Vert x-y\Vert )\) with an inner product \(\exp (\langle x,y\rangle )\), turns \(Q:\mathbb {R}^n\rightarrow \mathbb {R}\) into a fixed, but trainable parameter, and views V as the residuals between \(\mu \) and e. This connection arises due to the fact that both attention and NetVLAD turn hard assignments of the argmax into soft assignments using a softmax.

2.3 Writer retrieval

In the context of patch-based writer retrieval, current frameworks typically involve the extraction of individual patches from a document using techniques like SIFT keypoint locations. These patches are then processed separately inside some unsupervised computer vision model, such as a ResNet [12], and external mechanisms, such as Vector of Locally Aggregated Descriptors (VLAD) encodings, which are employed to merge the individual patches into a combined embedding. Recent works mostly focus on improving the feature extraction stage (see, for example, [1, 5]), but not the information aggregation phase. These methods rely on hand-crafted features [13,14,15,16,17,18,19,20,21] like SIFT descriptors [15, 16, 18, 21]. Deep learning-based solutions, such as [22, 23], tend to work in a very similar way, simply replacing the hand-crafted features with learned extractors [17, 18].

Therefore, in this study, we investigate the potential of attention-based approaches as feature aggregators in patch-based writer retrieval. This would allow one to train the actual retrieval pipeline end-to-end, leading to higher performance than is currently possible with two-stage patch-based models.

2.4 Limitations of current writer retrieval methods

The core limitation of patch-based writer retrieval is the forced usage of two-level optimization schemes. This means that the feature descriptors extracted for each independent patch cannot be optimized specifically for retrieval, as would be the case for example in document retrieval used in NLP (see [24]). The result is that patch-based retrieval schemes have to first optimize the feature encoders with a surrogate objective that is believed to be good for writer retrieval, before an independent patch-aggregator can produce the actual retrieval targets. This means that writer retrieval lags behind the performance observed in document retrieval, for example, because model designs have to be rather conservative when it comes to feature extraction and aggregation. In contrast to the two-stage approach, a learned aggregator can take advantage of the specific patch level structure, by backpropagating retrieval errors through the entire network to the feature extractors. Thus, optimizing the representation towards matching the information necessary for proper retrieval. This means that one can optimize for final retrieval performance by training on a relaxed version of nearest neighbor clustering (see 3). This could promise significant performance and robustness increases as practitioners are no longer reliant on specific feature extraction and aggregation schemes (Fig. 1).

Fig. 1
figure 1

Writer identification pipeline. Starting from the original image, we first run a patch extraction step as in, e. g., [18], after which the patches are processed independently via a ResNet-18 [12] model, finally, we aggregate all patches via either a Transformer or NetVLAD model into a single global prediction. In contrast to prior art, we train this entire method end-to-end. The focus of this work is on the part in green (aggregation and loss function)

3 Methodology

In this section, we describe the methodology employed in our study to investigate the potential of attention-based models as feature aggregators in patch-based writer retrieval. One crucial aspect to consider when jointly training feature extraction and aggregation is disentangling the impact of individual components—such as architecture, loss functions, and augmentations—from the overarching goal of improving the overall model. This is rather challenging, as contemporary methods cannot use any end-to-end document-level losses and models, and our document-level end-to-end models cannot solely rely on patch-level losses, meaning any combination we consider is going to be novel compared to existing patch-based systems. Specifically, models based on patch-level descriptors are missing the architectural and loss components necessary to weight and combine different patches. For this reason, we benchmark different combinations of models, losses and augmentations to “integrate out” the impact of any individual choice, i. e., assuming a performance measure \(p(\text {performance}| \text {architecture}, \text {loss}, \text {augmentation})\), we sample a large number of different combinations to get an estimate of the quality of retrievals, independent of any particular choice. For instance, if architecture A is worse than architecture B for all possible losses, then it stands to reason that architecture A is in generally ill suited for patch aggregation. Obviously, this cannot be exhaustive, but we can produce a first estimate of end-to-end trained writer retrieval models. We present the various components and objectives considered in our experiments, as well as the challenges encountered.

To begin, we adopt a standard ResNet-18 [12] feature extractor, which operates on \(32\times 32\) image patches. These patches are extracted at SIFT keypoint locations [25]. Unlike conventional approaches that rely on naive merging techniques, we propose end-to-end trained aggregators to combine the information from the individual patches effectively.

Regarding the learning objectives, we analyze four different variants. First, we explore supervised learning using cross-entropy, where the objective is to predict the writer IDs. This provides a natural “first glance” at the expected performance. We found that “center loss” [26], which is essentially a classification loss w.r.t. \(L_2\) distance and without bias, helps cross-entropy significantly, meaning our cross-entropy results also feature a center loss as a regularizer.

Second, we consider unsupervised learning using the VICReg algorithm [27] that aims to map different views of the same datapoint to the same image embedding (Invariance) while maintaining unit variance in each dimension (Variance) and zero covariance (Covariance) across dimensions. The objective behind this is to maximize the representational capacity of the d dimensional output latent space by maximizing the amount of non-redundant information in each embedding dimension.

The third objective we consider is metric learning through the InfoNCE objective [28]. This objective function optimizes a probabilistic relaxation of nearest-neighbor retrieval. The optimization is performed by maximizing the likelihood of selecting the correct “nearest neighbor cluster" among all other examples.

$$\begin{aligned} \mathcal {L}(x_1,\dots ,x_n) = \frac{\sum _{x_i,x_j\in \text {positives}} \exp (\langle x_i,x_j\rangle )}{\sum _{x_i,x_j\in \text {all}} \exp (\langle x_i,x_j\rangle )}. \end{aligned}$$

This approach is well-suited for writer retrieval tasks as it encourages the model to capture discriminative features specific to each writer and also mirrors the discrete top-writer retrieval problem used during inference. Additionally, we propose a novel data augmentation strategy that involves generating synthetic writers by mixing existing writers. This augmentation technique aims to increase the diversity of the training data and improve the model’s ability to generalize.

Last but not least, we consider Triplet loss [28], where two different “views” of the same input (\(\text {anchor}\) and \(\text {positive}\)) get compared with a different input (\(\text {negative}\)).

$$\begin{aligned}&\text {Triplet(positive,negative,anchor)}\\ {}&\quad = \max (\Vert \text {positive}- \text {anchor}\Vert ^2_2\\ {}&\qquad - \Vert \text {negative}- \text {anchor}\Vert ^2_2 + m, 0), \end{aligned}$$

where m is the target margin between positive and negative inputs. This method is one of, if not the oldest metric-learning method, but empirically still outperforms many newer methods [28].

One significant challenge encountered in our study is the memory constraints associated with transformers. Transformers require \(O(n^2)\) memory due to their self-attention mechanism, which becomes problematic when dealing with an arbitrary number of SIFT keypoints returned by the feature extraction process. To address this issue, we employ subsampling of the writer patches. Instead of presenting all patches, we randomly select a constant number of patches \(X_{\text {subsample}}\) for processing. Through empirical evaluation, we find that with roughly over 2000 patches, this subsampling strategy has little to no impact on existing writer identification pipelines. Therefore, we reason that any superior aggregation strategy should at least match contemporary methods, even if only sampled on subsets. To close the gap between the subsampled and full evaluation, we can utilize prediction ensembles that predict the final encoding as the mean of multiple subsets (though we found this to have a negligible effect for sufficiently many patches).

This subsampling approach not only alleviates the memory challenges but also provides a natural way of constructing examples for the unsupervised learning method. Moreover, it serves as a form of data augmentation for the other objectives, enabling the model to encounter a diverse range of patch combinations.

This problem of the vast number of patches for some documents also leads us to discard masked image modeling approaches like BEiT [29] that create unsupervised objectives by randomly dropping image patches with the objective being to reconstruct the missing image patches based on the known ones. Due to the number of patches in some documents, we cannot afford to place all patches into one patch sequence and thus have to subsample. However, as soon as we do this, the reconstruction is no longer uniquely determined by the other patches and instead depends on the choice of our subsample \(X_{\text {subsample}}\). Since BEiT is arguably not even a state-of-the-art method in normal image classification tasks, we reason that the performance of BEiT should be worse or fall roughly in line with all other methods (assuming one could find a way to train it).

4 Benchmarking

For our experiments, we consider the challenging ICDAR17 [30] dataset, containing a testset of 3600 handwritten pages from the 13th to 20th century across 720 different writers, and a training-set of 1182 pages across 394 writers. This dataset is known to be particularly challenging due to the high inter-class similarity and defects present in historical documents. As the size of the retrieval dataset is of crucial importance when evaluating retrieval models, we not only report the overall test performance on the entire test set but also showcase the behavior as we increase the number of candidates in the test set. We do this by first only retrieving the first 10 writers from the set of the first 10 writers, then the first 20 writers from the set of the first 20 writers, etc..., until we retrieve the entire dataset from itself (the normal test set configuration). This is to approximate the behaviour one would expect on differently sized datasets which might give some information over the scaling characteristics of the models as we increase the number of candidates.

4.1 NetVLAD

Fig. 2
figure 2

The progression of the test set precision for the NetVLAD architecture. The x-axis is the number of writers, while the y-axis shows the precision of retrieving “num writers” many writers from a “num writers” sized test set

We consider a NetVLAD [6] network with 100 clusters and a latent dimension of 64 on top of a ResNet-18 feature extractor. Looking at Figs. 2 and 3 we observe overall drops in performance as the number of writers increases. One curious artifact that will also occur in Sect. 4.2 is the sudden drop at around 50 writers, with fast recovery at around 100 writers. We hypothesize that this drop is connected to how writer IDs are generated, e. g., mapping writers from similar sources to similar writer IDs. Specifically, this drop vanishes if we choose a different order of writers. We choose to report this scaling study in “canonical” order of writers, as this is only supposed to be a rough reference in how dataset size affects the performance of learned feature aggregators.

We do not perform any re-ranking or additional cleanup of the prediction results aside from a simple PCA-whitening on any of the tasks to showcase the turn-key performance of end-to-end methods.

Table 1 Summary of our results compared to prior work. As one can see all methods are within roughly one percent in both mAP and precision lending credence to an architectural limitation. All values are computed over the full test-set

The main test set results can be found in Table 1, where one can observe significant drops in performance compared to both the state-of-the-art and the transformer implementation. Looking closer at the performance of the different methods, we can see that VICReg [27] is significantly ahead of the other methods in both Precision (\(\approx 8\%\)) and mAP (\(\approx 4\%\)). This trend continues into the transformer-based architectures, just by less of a degree. Looking at VICReg’s objective function (see Sect. 3), this suggests that other methods waste significant amounts of capacity in their latent codes, as VICReg is the only one of the methods that features explicit regularization of the embedding space with respect to embedding-efficiency.

In general, the results of our end-to-end models still loses notably against the model build by [1]. We hypothesize that this is due to the larger hypothesis set present in end-to-end models: If one trains the feature aggregator jointly with the rest of the model, not only does one have additional freedom in how to aggregate information, but also what information is being extracted in the first place. This is theoretically nice, but in practice additional regularization is needed to get a generalizable model. This hypothesis is strengthened by the fact that the model with the strongest regularization (VICReg) tends to perform best in both the transformer and NetVLAD regime. To our knowledge, VICReg is the strictest general-purpose self-supervised learning technique. One would presumably need additional constraints on the neural network to close the gap to e. g., [1]

Fig. 3
figure 3

The progression of the test set mAP on the NetVLAD architecture. The x-axis is the number of writers, while the y-axis shows the mAP of retrieving “num writers” many writers from a “num writers” sized test set

4.2 Transformer

Fig. 4
figure 4

The progression of the test set precision for the transformer architecture. The x-axis is the number of writers, while the y-axis show the precision of retrieving “num writers” many writers from a “num writers” sized test set

Fig. 5
figure 5

The progression of the test set mAP on the transformer architecture. The x-axis is the number of writers, while the y-axis show the mAP of retrieving “num writers” many writers from a “num writers” sized test set

The model used in all transformer-based tests utilizes three transformer layers on top of a ResNet-18 backbone. The transformer was chosen with this size as even when utilizing efficient attention mechanisms, such as Flash Attention [31], the overall memory utilization prohibits larger models on 40GB A100-GPUs. Qualitatively, even models with more than three layers that use gradient-checkpointing [32] for memory saving do not perform better. However, due to the large computational overhead introduced in gradient checkpointing these models cannot be trained sufficiently. In general, if the transformer would constitute an architectural edge, we would assume it was visible from the get-go: Even three layers of transformer already consume much more compute than simple e. g., average pooling, so even if we saw continual improvements with network depth, it would arguably not constitute a superior model, just one that trades off accuracy against performance differently.

Looking specifically at the mAP (Fig. 5) and precision (Fig. 4) plots, we can observe very similar behavior across all objective functions, suggesting a theoretical limitation rather than a limitation in the objective. It is worth noting that the results achieved here are not far off from the existing state-of-the-art: For example, [1] reports a precision of between 87.6% and 88.5% depending on which ResNet-backbone is chosen.Footnote 1 This is very similar to our results (see Table 1) of \(\approx 83\%\) precision. However, the transformer-based method is still consistently weaker than the simple NetRVLAD [1] produced by existing models.

The consistent improvements of all our methods at scale suggest that the size of the dataset might be a contributing factor. We hypothesize this trend exists due to the model making small errors that at scale get less and less relevant in the expected precision. This is contrary to what is usually expected with increasing retrieval sets, where we should see monotonic decreases in precision as the number of elements in the database increases. This is because the naive probability of accidentally retrieving the correct element drops linearly \(\frac{1}{|\text {items}|}\) as we increase the number of retrievable items. As previously mentioned, we hypothesize that this drop is connected to how writer IDs are generated, e. g., mapping writers from similar sources to similar writer IDs. Specifically, this nonlinear behavior vanishes if we choose a different order of writers.

It is also worth noting that every transformer is capable of simulating mean-pooling by simply setting

$$\begin{aligned} QK^T = \text {const} \end{aligned}$$
(2)

where one has to have the same constant in every row. This then leads to the softmax operation assigning all members of the row the same priority, which amounts to mean-pooling over V.

Further, we investigate the usage of ReZERO [33], which initializes all skip-connections with an additional parameter \(\alpha = 0\), such that the skip connections

$$\begin{aligned} x_{d+1} = \alpha f(x_d) + x_d, \end{aligned}$$
(3)

amount to an identity transformation at initialization. This has advantages for training convergence but also trivially allows the model to “turn off” the transformer layers, in which case it just falls back to mean-pooling. One would expect that if mean pooling is superior, the model would simply learn to keep \(\alpha \) low or zero, but this is not the case, even with weight-decay on \(\alpha \). This suggests overfitting to the training set.

In practice, we cannot observe any evidence of traditional overfitting, but those might be difficult to detect in the unsupervised and metric-learning cases, as both of these methods rely on relative metrics to the dataset, where the i.i.d. assumptions of regular deep-learning are dropped. It is very much possible that while the model does not overfit to any particular example, it overfits against the joint union of all training examples that make up the training retrieval set. Further, additional data augmentation, even on an objective level as is the case with the patch-based augmentation of “mixed infoNCE”, seems to have limited to no impact on overall performance.

4.3 Ablation study

We subject VICReg, InfoNCE, and Triplet loss to additional ablations, as we found these to produce more consistent estimators than cross-entropy, and the mixing of InfoNCE proved to be not as useful as hoped. Specifically, we want to consider model size as a function of depth and data augmentation in the form of erosion and dilation as these augmentations do not affect the writer style (Tables 2, 3, 4).

Table 2 Ablation of different network designs using triplet loss
Table 3 Ablation of different network designs using VICReg loss
Table 4 Ablation of different network designs using InfoNCE loss

Aside from the already existing configuration with 3 transformer layers and no additional augmentation presented in Table 1, we specifically study transformers with one or two layers. For VICReg (Table 3), all of these configurations drop performance, with the smallest model being the closest to the three-layer variant. Erosion and dilation augmentations on the three-layer model similarly drop the performance. In both cases, we consider a \(3\times 3\) structuring element with an application probability of \(10\%\). While neither augmentation improves performance, the erosion variant is significantly more detrimental to overall performance.

Expanding our view to other loss types, we find similar patterns with dilation augmentation dropping the least amount of performance when compared to the “optimal” configuration presented in Table 1. Erosion is consistently the most harmful to overall performance with VICReg being the most affected by erosion augmentation.

Looking at the relative behavior of the different loss types, one can observe that Triplet loss seems to be the most stable against architectural changes, dropping very little performance compared to even the best models in our “optimal” configuration. Of course, one also has to note that triplet loss is, in general, the worst-performing approach of all our approaches, so in some sense, one trades off performance against robustness when choosing triplet loss for this application. On further aspect in favor of triplet loss is the fact that it is the cheapest method to train, with VICReg being the most expensive (VICReg \(\approx 2\hbox {min}\) per -epoch, while triplet \(\approx 1\hbox {min}\) per epoch). It might be that one needs to increase the amount of compute spent on triplet to match the other methods (simply increasing the training time produced no improvements in performance).

5 Conclusion

In this study, we have rigorously explored the performance of smart aggregation methods in the context of writer retrieval. Our findings reveal a consistent and intriguing trend: regardless of the architecture, hyperparameters, training strategies, or augmentation techniques employed, learned aggregation schemes fail to outperform contemporary mean-pooling strategies.

This observation raises a fundamental question: why do these sophisticated, learned aggregation methods consistently fall short of the straightforward mean-pooling approach? We hypothesize that the answer lies in the additional degrees of freedom introduced by the learned methods.

The mean-pooling strategy, though rudimentary, offers a certain level of consistency and robustness. It computes a straightforward average of patch embeddings, providing a stable representation of the document’s content that is also independent of the overall document layout. In contrast, learned aggregation methods introduce complexity by attempting to capture intricate relationships between patches. While this can be advantageous for high-quality features, in practice the additional cross-correlation between patches may lead to overall worse representations, as unwanted artifacts (such as document layout) may permute the writer’s style information.

Furthermore, learned aggregation methods introduce additional hyperparameters and choices that must be tuned to achieve optimal performance. These choices, including architecture selection, learning rates, and loss functions, add an extra layer of complexity to the training process. This is a natural drawback of extending the scope of learned solutions and may prove a general barrier toward more complex scenarios: Writer identification may be a sufficiently well-understood problem where learning simply isn’t necessary, and existing classical writer retrieval techniques, informed by the problem structure, outperform learning-based methods. Specifically, the writer retrieval task itself may not inherently benefit from the sophisticated relationships that learned aggregation methods aim to capture. The simplicity of mean-pooling might align better with the nature of the task, where the goal is to identify writers based on their distinctive styles rather than complex inter-patch relationships. A final reason for the underperformance might also be a simple lack of data. Existing datasets that treat the problem as a dataset of size \({\text {num}}\_{\text {images}}\times {\text {patches}}\_{\text {per}}\_{\text {image}}\) might not have enough data to justify a learning-based feature aggregator that only has access to the effectively smaller \({\text {num}}\_{\text {images}}\)-sized dataset.

However, there are also significant avenues for further research into learned feature aggregators: One aspect is the utilization of constraints on how a model is allowed to aggregate information. This would counteract the relative decrease in dataset size and reduce overfitting. Another aspect is tuning the patch-aggregators to be specific for learned aggregators: Consider that with existing models it is quite easy for artifacts outside writer information to sneak into the representation (e. g., written characters), which makes it easier for patch-aggregators to fit some unwanted data. A more focused feature extractor may help prevent this. It might also come down to problem scale: Perhaps the same models would perform great on vastly larger corpora, which currently do not exist. Especially fully unsupervised training in the VICReg-style might be interesting as it does not need writer ID labels. Finally, imitation-learning-like techniques might be useful for learned feature aggregation: In this case, the existing averaging-based writer retrievers could act as a reference signal with learned feature aggregators trying to maximize their performance while being constrained to be similar to existing models, i. e., \(\min L(f_{\text {learned aggregator}}(x)) - {\text {KL}}(f_{\text {learned aggregator}}(x),g_{\text {reference}}(x))\). This variational objective might prevent overfitting as well. In general, further research is needed to overcome the current limitations of learned feature aggregators.

In conclusion, our study suggests the intriguing yet counterintuitive result that, in the domain of writer retrieval, smart aggregation methods do not outperform the simple mean-pooling strategy.