Classification of incunable glyphs and out-of-distribution detection with joint energy-based models

Optical character recognition (OCR) has proved a powerful tool for the digital analysis of printed historical documents. However, its ability to localize and identify individual glyphs is challenged by the tremendous variety in historical type design, the physicality of the printing process, and the state of conservation. We propose to mitigate these problems by a downstream fine-tuning step that corrects for pathological and undesirable extraction results. We implement this idea by using a joint energy-based model which classifies individual glyphs and simultaneously prunes potential out-of-distribution (OOD) samples like rubrications, initials, or ligatures. During model training, we introduce specific margins in the energy spectrum that aid this separation and explore the glyph distribution’s typical set to stabilize the optimization procedure. We observe strong classification at 0.972 AUPRC across 42 lower- and uppercase glyph types on a challenging digital reproduction of Johannes Balbus’ Catholicon, matching the performance of purely discriminative methods. At the same time, we achieve OOD detection rates of 0.989 AUPRC and 0.946 AUPRC for OOD ‘clutter’ and ‘ligatures’ which substantially improves upon recently proposed OOD detection techniques. The proposed approach can be easily integrated into the postprocessing phase of current OCR to aid reproduction and shape analysis research.


Introduction
The invention of printing with movable type is understood by many as the single most important innovation of the European Middle Ages. Since its introduction, letterpress printing has redefined the way we gather, process, and hand down information and knowledge from generation to generation. To analyze the contents of the earliest printed books, called incunables or incunabula, digital analysis techniques have become a valuable tool, taking their place alongside the bibliographical and photographic methods used by librarians and book scholars alike. Today, optical text and character Fig. 1 Overview of our method. A joint energy-based model (JEM) is used for glyph classification and separation of in-distribution (ID) and out-of-distribution (OOD) data recognition (OCR) is widely used to extract individual text components automatically or to make complete transcriptions of digital scans [21].
While modern OCR techniques build on advances in Deep Learning and really excel in transcribing printed media, their use in localizing and identifying the most elemental graphic units of text-so-called glyphs-often falls short of expectations. By design, state-of-the-art OCR analyzes lines of text to capture long-range semantic relations, making do with a rather crude localization in the process. At the same time, detecting glyphs with older rule-based algorithms [59] often fails to generalize confronted with uncommon font groups and it is also surprisingly susceptible to image degradation such as smears and smudges or other pathological residuals such as painted initials, lombards, rubrication, and handwritten marginalia. Moreover, frequently occurring sequences of adjacent characters are often represented in early printing as a ligature, using a single glyph printed with a compound type. Despite subtle differences in visual appearance and general shape, the contained graphemes are mostly not separated, either intentionally or due to the OCR approach's inability to do so. This behavior is undesirable for a downstream task that seeks to analyze and compare intricate details of glyph shape occurring in different printed editions and copies.
In this paper, we will show that these issues can be mitigated by extending the OCR-based glyph extraction with a subsequent classification and out-of-distribution (OOD) detection stage (Fig. 1). Such finetuning is an attractive solution that allows us to take advantage of the powerful features of current OCR methods, such as the enormous amounts of available transcription ground truth, the availability of optimized pre-processing steps, and the ease of generalization.
To do this, we use a joint energy-based model (JEM) [28] to estimate the true probability distribution of the target glyph images and classify the OCR-based image patches into a single character class. We then exploit the distribution estimate to separate in-distribution (ID) and OOD samples of 'clutter' images (e. g., empty spaces, rubrications, initials) or ligatures. Through a comprehensive series of experiments on a digital reproduction of an incunable, we observe strong classification and OOD detection performance, surpassing other state-of-the-art inference-time detection strategies. We further demonstrate the method's robustness when exposed to images with different types of degradation frequently observed in scannings of printed historical documents.
Our contributions to digital document analysis focus on the following areas: • We show that classifying glyphs with a hybrid model matches the performance of purely discriminative approaches, despite being additionally tasked with density estimation. • We show that enforcing a margin between ID and OOD energies during the training of JEMs substantially increases OOD detection performance for image clutter and separation between single characters and ligatures. • We introduce an effective extension to JEM's generative sampling that uses information about the typical set's geometry, i. e., regions with high contribution to the expectation, to stabilize the optimization process.

Detection of glyphs and symbols on historical prints
Detecting and localizing glyphs or symbols on historical prints for reproduction or fine-grained analysis is mostly coupled to the more general efforts and algorithmic developments in OCR [21]. Early approaches typically rely on geometric rule-sets [14], classical shape descriptors [48], or carefully crafted features in a machine learning approach [73] to locate and categorize recurring symbols. Subsequent methods use different variants of template matching [6,7,9,25], employ graph matching [13], or follow the advances in deep learning with a particular focus on few-shot reference learning [61], object detection [8], or probabilistic generative modeling for generalized glyph shape assessment [26].

Energy-based models
The concept of energy-based models (EBM) dates back to early research on the collective properties of neural networks [35] and Boltzmann machines [1] before being set forth as a distinct approach by LeCun et al. [41]. Since then, EBMs have become an essential member of the family of deep probabilistic and generative networks [4]. They are used for implicit generative modeling [20,54], predictive modeling [28], continual learning [44], iterative reasoning [18], compositional learning [16,17], and OOD detection [22,65]. They also have been used to re-interpret the workings of other approaches such as Variational Autoencoders [69] or Generative Adversarial Networks [11]. We refer the reader to the very instructive works by Du et al. [20], Song et al. [60], Osogami [55], and Arbel et al. [2].

OOD detection
OOD samples typically stem from a secondary distribution that differs from ID data by a semantic shift in the label space or a covariate shift. Yang et al. [71] provide a detailed review on OOD detection methods and their commonality with open set/open world recognition [39]. In the context of predictive modeling, popular OOD detection directly rely on the softmax confidence scores [31,45], evaluate the distance of the sample to the nearest class-specific distribution [42], or utilize energy scores computed on the classifier's logits [46]. Recent methods build upon the observation that ID data tends to show higher vector norms in gradient space compared to OOD data [36], or evaluate the k-nearest neighbor distances between feature vectors of the training set and an inference sample [63]. Furthermore, some early works explore the idea of the typical set for OOD detection [12,50].

Theoretical background
In this section, we begin by introducing the general concept of EBMs and how they can be trained for modeling some probability density. After that, we will show how this notion can be used within conventional classification tasks using JEMs [28]. Following that, we will explain how these hybrid models can be used to detect OOD samples and how we can aid this goal even further by actively shaping the energy surface. Lastly, we will propose a correction for a more stable optimization of JEMs based on the concept of typical sets.

Energy-based models
Given a limited number of observations x ∈ R D , we often want to explain the underlying distribution of that data by constructing an estimate q θ (x) of the true probability density function p(x), ideally using as few assumptions about the data as possible.
Energy-based models seek to establish a relation between these probabilities and certain energy states using a parameterized energy function [41]. This relation is expressed as a Boltzmann distribution q θ (x) = exp(−E θ (x)) Z θ , where E θ (x) : R D → R is the energy function with parameters θ which maps a point of a D-dimensional input space to a single scalar state, called the energy [37]. The partition function Z θ = x exp (−E θ (x)) dx normalizes the energy magnitudes with respect to x to ensure that the fundamental properties of a probability distribution are satisfied. If the parameters θ are configured well, realistic observations are assigned low energy values, and unlikely observations correspond to high-energy states. Intriguingly, the choice for E θ (x) places no other restrictions on the function type and can be efficiently parameterized by a neural network with weights θ .
For most scenarios, optimizing these weights with conventional maximum likelihood learning on some training data X train is not straight-forward as evaluating the normalization constant Z θ (which is required to assess the likelihood of x) is either analytically intractable or requires time-consuming numerical integration. On the other hand, only estimating unnormalized densities by maximizing exp (−E θ (x)) is not feasible since the likelihood of the training data changes with different values for Z θ .
Fortunately, even though we cannot estimate the normalized densities directly, we can use Contrastive Divergence (CD) to approximate the gradients of the energy function [10,33,34,67]. More formally, the partial derivative of the negative log-likelihood objective − log q θ (x) with respect to weights θ can be expressed as This contrastive update scheme increases the likelihood of samples from the true distribution p(x) while reducing the likelihood of samples from the generated distribution q θ (x), eventually reaching an optimum at the equilibrium p(x) = q θ (x). Now, instead of evaluating the integration of Z (θ ), we approximate it numerically by taking a sample from the model distribution q θ (x). Although we are still not able to analytically sample from q θ (x), we can set up a Markov Chain that stochastically explores the model distribution and eventually converges to the equilibrium/steady-state distribution p(x) using K cycles of Monte Carlo sampling (MCMC). This iterative transition exploits the ratio between two evaluated points q θ (x) and q θ (x ), which allows to cancel out the partition function (cf. [68]).
To reduce very long mixing times of the Markov chain, we can use an alternative sampler in continuous space state based on Stochastic Gradient Langevin Dynamics (SGLD) [5,41,53,67,70]: where π(x) is a prior distribution from which a sample can easily be drawn (e. g., a Gaussian distribution), η is the step size, and is an additive Gaussian noise. This update equation adapts a random initial noise by evaluating the gradients of the modeled probability density and behaves like gradient descent toward lower energy values.

Classification with joint energy-based models
We recall that EBMs estimate the probability density q θ (x) over the observation space with a high likelihood for data that are part of X train . In contrast, predictive classification models want to estimate the conditional distribution q θ (y|x) for data points x ∈ R D and labels y ∈ R over a number of distinct classes C. In practice, this distribution is most often estimated using a parametric mapping f θ (x) : R D → R C that assigns real numbers (logits) for all classes to each data point. The logits are subsequently normalized to obtain the categorical distribution as where exp( f θ (x) [y]) is the y th logit. As shown by Grathwohl et al. [28], both these concepts can be elegantly combined by interpreting the classifier's logits as unnormalized densities of the joint distribution q θ (x, y) = exp( f θ (x)[y])/Z (θ ). In other words, we can use the logits of a discriminative classifier to define an EBM using the relation The marginal distribution over x then defines a density model over the data x as Training f θ with respect to the joint distribution can be efficiently done using the factorization log q θ (x, y) = log q θ (x) + log q θ (y|x) [28,56]. This allows us to optimize the categorical distribution q θ (y|x) using the cross-entropy objective function and to optimize the densities q θ (x) with CD. The resulting hybrid model (cf. [49]) with generative and predictive components is known as a Joint Energy-based Model (JEM) [28].
Within SGLD, we can then draw samples from q θ in two ways: (1) Computing gradients with respect to log y exp( f θ (x)[y])), which corresponds to unconditional sampling. (2) Computing gradients with respect to log exp( f θ (x)[y]), which allows class-conditional sampling.

Incorporating out-of-distribution data with energy margins
One characteristic trait of EBMs is their ability to assign substantially different energy states to ID data and data suffering from semantic or covariate shift, more generally called OOD data. Thus, EBMs implicitly are competitive classifiers to distinguish between ID and OOD data. Critically, however, optimization with CD does not enforce steady energy states for data sampled from p(x) and q θ (x) as it is unaffected by any additive bias on the energy function's output. The energy magnitudes fluctuate across the CD iterations and might drift into numerically unstable ranges, which negatively affects the contrastive shape of the energy surface by introducing steep gradients and making it increasingly non-smooth. This process leads to overall worse approximations to the expectation of the energy w. r. t. the model distribution, such that finding an appropriate discrimination threshold for OOD detection becomes very difficult.
An intuitive and simple concept to counteract this behavior is to establish explicit energy levels for p(x) and q θ (x) during CD. Based on the idea of Du et al. [20], we add L2 regularization to constrain the Lipshitz constant, i. e., the indistribution loss becomes: with x + ∼ p(x) and x − ∼ q θ (x). This regularization term ultimately enforces an energy around E θ (x + ) ≈ 0. Because of the remaining approximation bias in q θ (x), the energy for x − typically ends up being a little bit higher.
In addition, we are often interested in specific scenarios where we know well in advance certain characteristics of OOD data that might interact with our system. This information is particularly valuable when the characteristics relevant for distinguishing between ID and OOD data differ only by nuances. We can assist the contrastive optimization scheme by explicitly introducing known (and task-specific) samples for which we enforce a high energy level.
Let us consider a secondary dataset X OOD with known OOD examples x * ∈ R D . For each CD iteration, we can draw a random sample from this dataset and penalize the assigned energy value E θ (x * ) if it is below a certain margin τ ∈ R. We build here on the idea of Liu et al. [46] and  [46] implement this penalty term using a squared Hinge loss as The effect of this energy regularization for OOD data is illustrated in Fig. 2.

Introducing probability mass to SGLD sampling
A defining concept in CD is to numerically approximate the expectation of the target probability density using an MCMC estimator. To find a good estimate in a computationally efficient manner, we must make sure to focus on regions in the sample space that contribute significantly to the expectation [3]. An intuitive strategy for sampling the model distribution is to explore regions around the mode where we observe high density. This objective agrees well with the concept of maximum likelihood models but is unable to take into account the geometric composition of high-dimensional parametric spaces. Specifically, regions of high probability density lie in a compact neighborhood around the mode but constitute only a small volume. On the other hand, regions of high volume but low density are located toward the tails of the distribution [3]. This effect strengthens and consolidates with increasing dimensionality, such that explorations of the sample space that are based on high-density sampling inevitably ignore important parts of the contributions to the expectation. Note that the strength of this contribution at any point is determined by the product of density q θ (x) and volume dx, known as the probability mass. Consequently, in order to locate the areas with the most significant contributions to the expectation, we have to carefully consider both the density and the volume. The region where the highest mass concentrates is located between the two extreme points and is generally known as the typical set [3,12,50].
This notion of a typical set plays a decisive role when modeling Markov chains. For a well-structured target distribution that is preserved under some Markov transitions, the Markov chain first navigates toward the typical set and then explores it further to reduce the strong initial bias until consolidation. Unfortunately, this process may be disrupted as the Markov By querying information about the structure of the typical set at each MCMC cycle, we can inform the Markov transitions about areas with locally high contributions to the expectation. For finite Markov chains, using only information about the probability density leads to slower mixing times and thus may result in more biased samplers transitions are limited in their ability to effectively probe certain types of regions, e. g., high-curvature non-smooth neighborhoods which are very common in practice. For low mixing times, this leads to heavily biased Monte Carlo estimators. With longer mixing times, the Markov chain can sporadically explore these regions, which, however, leads to substantial oscillation in expectation. As a result, running a finite Markov chain in presence of such pathological areas most likely leads to biased and sub-optimal estimators which never reach the true distribution.
It can therefore be advantageous to calibrate the Markov transitions to the geometry of the typical set by querying the target distribution. We propose a lightweight and efficient implementation of this idea by introducing a correction term to the SGLD update equation [Eq. (2)]. This correction harnesses information about the typical set's geometry at each update step to prioritize Markov states that maintain a certain level of probability mass (Fig. 3).
As a notion for the probability mass at a point, q θ (x)dx, we use a differential approximation presented by Gratwohl et al. [28] using gradients of the log-density/energy function. As an approximation, the Euclidean norm of these gradients should be small for areas of high mass. Formally, this relation can be written as follows: where This correction term can then be used during SGLD update to steer the Markov transitions along a trajectory with sufficient probability mass toward the typical set: γ ∈ [0, 1] is a relative weighting factor that can be tuned to find a balance between high-density and high-mass sampling. In the following, we call this concept mass-corrected sampling (MCS).

Final objective function
Putting it all together, we arrive at the following objective for JEM training: L ML marks the CD policy [Eq. (1)] to optimize the densities q θ (x) with the standard biased SGLD sampler according to Eq. (2), or the mass-corrected sampler according to Eq. (9). L CLS with weight α ∈ R marks a cross-entropy objective used to optimize the predictive component q θ (y|x). L ID with weight β ∈ R corresponds to the L2 regularization given in Eq. (5), and L OOD with weight λ ∈ R is the energy margin penalization for OOD data given by Eq. (6). Unless stated differently, we train the models using Eq. (10) with weights α = 1, β = 0.1, λ = 0, and default SGLD (Eq. (9)) with γ = 1.

Application to glyph classification
In this section, we describe the data to which we want to apply the algorithms presented in the previous section. First, we specify the selection and preparation of glyphs from an incunable edition. We follow that by describing two OOD datasets that cover frequently observed pathological residuals in incunables which are often falsely detected by OCR frameworks. Finally, we provide details on data preprocessing, model implementation, and the metrics we use for method evaluation.

Incunable glyph data
In order to obtain a large corpus of incunable glyphs, we followed a two-step process. First, we extracted possible candidates and an estimate of their associated class using an OCR engine trained on a large corpus of synthetic text lines with Latin characters 1 [64]. This step was followed by careful manual selection and curation for each class by a domain expert.
In particular, we set up a pre-processing and glyph extraction pipeline with OCR-D [51] which included the following operations in sequential order: (1) binarization, (2) cropping, (3) denoising, (4) region segmentation, (5) combination of rule-based (glyph-level) and LSTM-based OCR (line-level) with a language model that was pre-trained on synthetic Latin prints in mostly Antiqua and Fraktur font, 2 (6) glyph extraction based on position decoding Connectionist Temporal Classification (CTC) (the detailed workflow is given in Table 4).
All glyph candidates were then categorized based on the most probable grapheme as identified by the OCR engine.
We used this pipeline on high-resolution scans of the first 71 pages of a digital reproduction of Johannes Balbus' Catholicon from the copy held by the John Rylands Research Institute and Library Incunable Collection (shelfmark: 3383) (cf. ISTC ib00020000 [38], GW03182 [24]). This edition of the Catholicon was presumably printed in 1460 using a Gotico-Antiqua type. The number of extracted candidates over all characters defined in the language model was 345,152.
After extraction, we defined 42 classes covering the most prevalent lower-and uppercase characters according to a manually curated glyph table [66]. For each class, all described allographs of the character apart from typographic ligatures were considered. Following that, up to 1,000 random samples from each class (if applicable) were selected before being reviewed and pruned by an expert in typographic printing and historical book studies. The cleaned corpus contains 16,171 images distributed across all classes, split into train/validation/test subsets according to a 70 %/15 %/15 % ratio (11,319/2426/2426 images).

Out-of-distribution data
We optimize and evaluate the OOD detection performance on three auxiliary datasets. Each set contains samples with selected characteristics that lead to an undesirable but plausible shift in the true distribution p(x). The selection of these characteristics was based on the expertise of book scholars, their analyses of OCR extraction results, and their definition of undesirable patterns in subsequent reproduction.
(1) Clutter We identify frequent false-positive candidates from the OCR extraction pipeline and combine them in a single OOD dataset.

Data pre-processing
All images were converted from RGB to grayscale and were padded with zeros according to the largest edge in height and width dimensions present in the incunable glyph data. The smaller image dimension was then reshaped to 56 px while matching the larger dimension according to the original aspect ratio. Lastly, we center-cropped the images to a uniform spatial resolution of 56 × 56 px, facilitating minibatch training. Importantly, these steps maintain the relative scale between the image samples.

Neural network architecture and training protocol
We use a lightweight Convolutional Neural Network (CNN) for all experiments and ablation studies. The network topology is provided in Appendix Table 5.
We use a maximum epoch size of 120, an Adam optimizer with a default learning rate of 0.0001 with exponential decay every second epoch with a factor of 0.97, and a batch size of 32 across all datasets. We further configure the biased SGLD sampler with K = 60 MCMC cycles, a large learning rate η = 20, and a reduced amount of injected noise with σ = 0.005, promising faster convergence in scenarios with little cycles K albeit straining the balance between the sample variance during stochastic gradient descent (early mixing) and the variance during posterior sampling via SLGD (late mixing) [67].
In addition, we use the concept of reservoir sampling/replay buffer [19,20] to reduce the Markov Chain mixing times by using past samples to initialize SGLD sampling. For each mini-batch, we use 95 % buffer images and 5 % new Gaussian noise samples.

Evaluation metrics
Classification We evaluate the classification performance using two threshold-independent metrics: (1) we compute the area under the precision-recall curve (AUPRC) using a one-vs-rest approach, and (2) we evaluate the area under the receiver operating characteristic curve (AUROC). We also report the expected calibration error (ECE) [29] to measure the alignment between the predicted softmax probabilities and the classification accuracy. OOD detection We evaluate the OOD detection performance by interpreting it as a binary classification problem between ID data (true class with label 1) and OOD data (false class with label 0). We report the AUPRC and AUROC (trapezoidal rule), as well as the false-positive rate at a true-positive rate of 95 % (FPR95).

Experimental results
In this section, we evaluate JEMs and the presented extensions on the glyph dataset. We start by comparing its classification performance with that of a purely discriminative approach. Then, we evaluate the model's ability to detect OOD samples and examine if learning a margin between ID and OOD data in the energy spectrum can improve this performance. Furthermore, we analyze the correction term for SGLD concerning the typical set and study its impact on training stability and general performance. Finally, we assess the algorithm's susceptibility to several types of image degradation and inspect generative samples of the typical set for a subset of the classes.

Glyph classification with JEMs
We start by comparing the classification performance between a discriminative baseline model (CNN) and unconditional and class-conditional JEMs in default configuration (cf. Sect. 3.5). The results are summarized in Table 1. The best results per metric are marked in bold ↑Marks 'higher is better' ↓Marks 'lower is better' * Model did not reach convergence †Model converged with mult. learning rate decay every forth epoch The discriminative model (CNNs) shows a strong classification performance with an AUPRC of 0.972 and excellent confidence calibration. Despite being additionally tasked with density estimation, the JEM with unconditional sampling even improves beyond this predictive model with a minor increase in AUPRC, albeit at a higher calibration error. As can be observed in Fig. 4, the most common characteristics among the wrongly assigned images include margin crops, lack of centering, label noise (ligatures, rubrications), or partial under-inking which leads to unfavorable similarity to other classes. Moreover, underrepresented classes at the long tail suffer from diffuse predictive probabilities, so that assignments to similar classes are very likely (e. g., 'k' → 'h,' 'G' → 'E').
Interestingly, class-conditional sampling (JEM-CC) leads to noticeably worse results with a difference of one percentage point to the unconditional variant (Table 1). By design, class-conditional sampling forces the Markov Chain to concentrate into a distinct and more singular area where the probability mass for the specific target class is high. Given that we start the Markov Chain with a random initial point in the sample space, this region may be distant or may require a trajectory across or out of regions with high contribution to the expectation of visually very similar classes. Although this confounding effect is heavily reduced by using a classspecific sampling reservoir (as we initialize the sampling at nearby points), it could be even further alleviated by increasing the Markov Chain mixing time to compensate for the longer or more complex trajectories.

OOD detection of clutter and ligatures
In this section, we compare the OOD detection performance of the hybrid JEM models with different discriminative inference-time detection methods. In particular, we evaluate different choices of the score function s θ (x) ∈ R for ID and OOD samples and evaluate the separability of the two resulting score distributions (cf. Appendix Table 6).
We consider three OOD detection strategies that make use of the classifiers' softmax scores. The baseline method by Hendrycks et al. [31] uses the magnitude of the classifier's highest softmax score, which we refer to as predictive probability. Intuitively, ID images should be assigned a single high class score close to one, whereas OOD samples are expected to yield a rather diffuse probability, ideally approximating a uniform probability over all classes. In ODIN [45], the authors build on this idea and use temperature scaling and input perturbations based on the image gradients to increase calibration of the prediction confidence (cf. [29]), allowing to push apart the predictive probabilities between ID and OOD image samples. While ODIN exploits the gradient domain only implicitly to calculate perturbations, GradNorm by Huang et al. [36] directly derives scores from the gradient space by calculating the vector norm of the gradients backpropagated from the Kullback-Leibler divergence between the softmax output and a uniform distribution. We observed that the scores obtained with this approach were significantly higher for ID data than for OOD data. Moreover, we evaluate score functions based on the free energy score across all class logits [46], as well as on the Mahalanobis distance of a sample to the nearest class-conditional Gaussian distribution using a notion of an induced generative classifier under Gaussian discriminant analysis (GDA) [42]. This strong assumption of the class-conditional distribution in the feature space was pointed out by Sun et al. [63] as a potential weakness. They proposed to use a non-parametric estimate using k-th nearest neighbor (KNN) distances between the trained ID embeddings and the embedding of the input sample. For the hybrid models, we use again predictive probability, which performed better than energy-based scores based on probability density and mass during an initial analysis on the OOD datasets' validation splits (cf. Appendix Table 7).
The results in Table 2 reveal that most OOD detection methods reliably separate the clutter images from ID data. Surprisingly, applying GradNorm leads to a much We used the predictive probability score function (cf. Table 6) to evaluate the OOD performance of the hybrid models Best models for each metric are marked in bold worse separation of the two distributions, suggesting that the differences in gradient magnitude are not sufficiently discriminative. We hypothesize that the inter-class differences in the glyph are smaller than those in the ID datasets considered in the original work, which might lead to less informative gradient norms. The KNN method yields the best results by a large margin. In comparison, all discriminative methods yield noticeably worse detection rates for the OOD ligature data. Interestingly, an additional calibration of the prediction confidence does not aid OOD detection. For example, ODIN utilizes perturbations of the input to increase the maximum softmax score of the ID samples. As we already observe exceptionally high scores for the ID data at well-calibrated confidence (cf. Table 1), the effect is seemingly negligible. It should also be noted that neural networks in many scenarios tend to produce overconfident predictions [30,32,43,47,52]. These scores can also be disproportionately high for OOD data, even if the data is far away from the softmax decision boundary [42], which curtails the effectiveness of ODIN further.
Using the Mahalanobis distance score significantly improves the ligature detection rate in terms of AUPRC and AUROC scores compared to the predictive probability baseline. At the same time, the FPR at a fixed threshold of 95 % TPR is shown to be worse than the predictive probability baseline, highlighting the importance of adequate threshold selection. Because the class-conditional distributions relevant for calculating the distance-based scores are derived using features from different receptive fields, we argue that both local details and global context are important to capture the intricate differences between stand-alone and linked allographs. For example, ligatures starting with the character 'f' might connect to the adjacent character by a slight extension of the ascender (e. g., 'l'), the crossbar (e. g., 'o'), or might only show an infiltration into the baseline region without a continuous ink trace.
Similar to the observations on the clutter data, the KNNbased OOD detection yields very competitive AUROC and FPR95 scores that substantially improve over the predictive probability baseline. However, while this tells us that KNN produces very few false positives (ID samples assigned to the OOD distribution) given a high TPR, the low AUPRC indicates that we still obtain a relatively large absolute count of false positives. Furthermore, KNN explicitly uses only feature embeddings from the penultimate layer to the nearest neighbor distances. For that reason and based on the observation of the Mahalanobis distance score, it would be interesting to investigate whether a weighted addition of earlier network layers with a stronger local context affects detection quality.
The unconditional JEM without any further modifications matches the OOD detection rate of the best-performing discriminative models for ligatures and clutter images. In comparison, actuating the trend in classification results, the class-conditional JEM achieves worse results than the unconditional variant, which is particularly evident in the poorer detection of ligatures. The four percentage point difference in AUPRC performance suggests that it is useful to capture overarching features across all classes, such as often occurring linking patterns between adjacent characters.

Effects of an energy margin between ID and OOD data
In this section, we introduce energy margins between ID and OOD data with an additional Hinge loss term (cf. Sect. 3.3) and evaluate its influence on classification and OOD detection. We begin by exploring the hyper-parameter space for different combinations of the energy margin τ ∈ {3, 6, 9} and   Fig. 8). The bestperforming configurations were found to be (τ = 3; λ = 0.03) for JEM and (τ = 3; λ = 0.01) for JEM-CC. We observed that the class-conditional variant suffers from a diverging CD training loss with eventual breakout for almost all runs with higher margins or weights. While this phenomenon affects the classification abilities only marginally, it impairs the OOD detection on both OOD datasets and leads to unstable training objectives. We now apply the models with a tuned energy margin on the test set ( Table 3). The explicit separation of clutter images yields only minor improvements to the OOD detection scores as these images are already assigned a sufficiently large energy value by default. This can be explained by a large difference in observed characteristics between members of the target classes and the considered types of clutter. Conversely, enforcing a margin between ID data and ligature images shows over 2 percentage points improvement in OOD detection AUPRC and reduces the FPR95 by 17.2%.
This detection performance is also notably better than that of the KNN method, especially in the absolute number of false positive assignments to the OOD distribution. Here, providing the network with hard OOD samples to shape the energy surface in between ID and OOD data distributions helps to pinpoint intricate visual differences despite the strong semantic similarity. For both unconditional and class-conditional modeling, the classification performance slightly drops when certain energy states are enforced.
A particularly interesting observation can be made for the models that are trained on the combined OOD data pool. Although the OOD detection rates can be improved for both clutter and ligatures, the magnitude of the improvement falls short of the specialized training variants. This effect suggests that different classes of OOD data should not be assigned to the same region of the energy spectrum but rather be treated separately. In this context, future work could investigate into treating different OOD data as additional classes during training of the predictive component.

Effects of using probability mass in SGLD
We continue by introducing a notion of probability mass to the SGLD update equation (i. e., mass-corrected sampling) Fig. 5 Contrastive divergence training loss curves illustrating the mediating effect of mass-corrected sampling (MCS) for class-conditional JEMs. Larger OOD energy margins τ or weights λ require increasingly sharp energy gradients for class separation, which can lead to unstable gradient-based optimization and irrecoverable estimator bias and analyze its effects on training stability and sampling quality (cf. Sect. 3.4). The results are summarized in Table 3.
In general, MCS shows no effect when it is used to train unconditional models. For class-conditional sampling, however, MCS leads to a significant improvement in discriminating between ID images and ligatures with 4.7 percentage points in AUPRC. This trend is reinforced when MCS is used together with an explicit ID/OOD energy margin, albeit at the cost of a noticeable decrease in classification accuracy. From a conceptional perspective, training with a class-conditional sampler requires a comparably sharper separation of classspecific densities in sample space. This effect makes it difficult to fulfill the contrastive update scheme, which builds on the idea of increasing the likelihood of a sample given a specific class while at the same time pulling down the likelihood of samples from the model distribution and known OOD samples. We argue that in such a scenario, MCS helps to stabilize the SGLD update trajectory in more pathological regions by favoring Markov transitions that do not move away too far from the typical set [3].
The visualization of the CD training loss in Fig. 5 shows that this stabilizing effect prevents loss breakouts and the subsequent divergence of the optimization process. This is particularly true for more aggressive hyper-parameter selections with large energy margins τ and weighting factors λ.

Effects of image degradations
This section examines the hybrid model's behavior when exposed to images affected by different degradations frequently observed on historical printed documents. Specifically, we are interested in the impact of these effects on classification and OOD detection performance. die häufige Probleme bei historischen Drucken abbilden. We consider six different image augmentations which are applied and evaluated in an isolated manner. The results for each augmentation type are visualized in Fig. 6. With increasing additive Gaussian noise up to a standard deviation of 0.75 px, the classification performance and OOD detection rate remain almost constant. For more aggressive noise configurations, the augmented images increasingly resemble the initial Markov states in SGLD sampling which correspond to very high energies. Furthermore, decreasing the image resolution until the point of losing most of the visual characteristics of the original letter (at 25 % of original resolution) has no considerable effect on the performance metrics. Given that historical documents were produced to inconsistent standards, show strongly varying traces of age or use, and were scanned using a wide array of scanners, cameras, and recording protocols over the past decades, this robustness of the model to different resolutions is a powerful property.
Exposing the network to images with fewer amounts of applied ink (erosion) quite quickly leads to a noticeable drop in classification and OOD detection performance. A particularly interesting observation can be made for the input density which barely drops with different sizes of the structuring element, indicating that the characteristics of under-inking have significant contributions to the expectation. This is in contrast to the quickly decreasing maximum softmax score. Consequently, it can be concluded that also for this scenario an OOD score function based on the predictive part of the hybrid model is the better choice. The decrease in performance is less pronounced for broadening of the characters (dilation) where the drop only occurs after reaching unrealistic levels of ink coverage. Compared to the erosion results, the value by which the energy scores increase is significantly higher as the size of the structuring element increases. It can be concluded that over-inking plays a minor role in the present data set. At the same time, the classification performance remains stable even at moderate dilation rates.  Figure 6 further reveals that even for very aggressive JPEG compression rates both OOD detection scores and the performance metrics hold up very nicely. This is an important property since a considerable amount of scanned material is made available to the public in compressed JPEG format. Also, minor skewing of the scanning page with rotation angles of ±10 • still yields sufficient performance scores.

Qualitative analysis of class prototypes
In this section, we exploit the generative capabilities of our model for class-conditional sampling. Using the notion of probability mass (Eq. (9)), we seek to visualize image characteristics that play a significant role in optimizing the model distribution q θ (x). The generated class prototypes for different lower-and uppercase characters are shown in Fig. 7.
In general, the considered classification task works on a rather standardized type design where variance is mostly contained in smaller local details. This variance includes different amounts of ink along the centerline of the character, age-related degradations (e. g., smears, smudges), deviations between different metal types due to wear, and various types of allographs and their defining typographic properties (e. g., contraction signs like overline or curved abbreviation stroke). These characteristics are well captured across several generated images. Moreover, for characters like 't' with curved abbreviation stroke (Latin abbreviation for -tur or less often -er), we observe reoccurring visual features like a small part of the adjacent letters, indicating that the model successfully captures semantics in the margins of the glyph images.

Discussion and conclusion
Contemporary OCR frameworks are a promising approach to extracting individual glyphs from a large corpus of digital reproductions of early printed documents. However, because they are designed for high-performance transcription, their ability to accurately localize and identify the constituting glyphs is limited. Unfortunately, rule-based detection methods are equally problematic as they often fail in the presence of artifacts introduced by the physicality of the printing process, age-related wear, or typographic decorations. With this in mind, it is desirable to prune any falsely recognized glyphs and correct for potential miscategorization.
To facilitate downstream finetuning we use a joint energybased model that allows the simultaneous classification of glyphs into a set of character classes and separation between ID images and OOD samples often wrongly recognized by OCR frameworks. We show that this approach matches the classification performance of its purely discriminative counterpart and improves the OOD detection rate by a substantial margin without adding to model complexity.
One of the main ideas is to leverage prior knowledge of certain properties of OOD data that interact with our system. Using hard negative cases of rubrications, lombards, empty spaces, or very nuanced ligature crops, we can actively shape the energy surface and facilitate a more purposeful categorization of the glyphs. Fortunately, JEMs grant us extremely simple access to steer and modulate the distribution estimation by introducing a penalty margin on the energy surface that spreads apart ID and OOD regions. We see that this not only helps to optimize the energy landscape but also leads to diffuse predictive probabilities for the OOD samples across all classes. Despite empirical evidence of the advantages of this measure, we also identify some fundamental difficulties. OOD samples might share most of their visual characteristics with ID samples. In our scenario, this is especially the case for ligatures which mostly differ from the target stand-alone characters in very intricate linking patterns at the margins. From a theoretical perspective, this drastically increases the difficulty for the model to establish a sufficiently large difference in the energy scores. The model can only correct this by inducing very steep gradients into the model distribution, which might impact model stability or convergence of optimization.
A second problem occurs if different types of OOD data are penalized using the same separating energy margin. As we have shown empirically, this leads to an overall decrease in most performance metrics. Similar to the difficulties mentioned above, different OOD data with potentially contrasting characteristics are encouraged to concentrate into more narrow densities. In such scenarios, it might be viable to assign slightly different energies based on the visual similarity to the ID data.
As another valuable result, we observe that the classification performance and OOD detection rate are very robust to common image degradation. Given the partly generative nature of JEMs, the model can recover partial or corrupted information under the learned distribution. This property is especially important for the analysis of early printed material, which often contains a huge amount of artifacts. At the same time, the empirical ablation experiments are based on in-silico artifacts and, therefore, only allow a rough estimate of the actual generalizability to the many nuances in real historical prints.

Limitations
Currently, method evaluation is done on a single digital reproduction only. This might seem a manageable problem at first, as incunables comprise only a relatively small percentage of the total body of letterpress prints with moderate variance in fonts and typographic styles. However, their digital analysis must deal with a tremendous variety of scanner types, acquisition protocols, and the state of conservation which differ between libraries, editions, copies and individual reproductions. It is, therefore, indispensable for a widely usable application to either train very specialized models or to use a single model operating as generally as possible. Furthermore, the proposed fine-tuning technique at its core implements a data-driven pruning algorithm to extract only the most probable images and discard implausible ones. If the downstream task requires (near-) full coverage, this rather aggressive approach might not be appropriate. In this case, optimization efforts should concentrate on building more robust OCR models, albeit requiring additional laborious transcription work. Lastly, during method development, we observed that larger CNNs like the Wide Residual Network [72] or the Vision Transformer [15] typically converge to odd glyph shapes. Similar behavior was already described by Nijkamp et al. [54] and Du et al. [19], confirming the generally known stability problems of EBMs [20]. With this in mind, we want to point out to the reader that there are also alternative options for training EBMs that could help circumvent these issues (cf. [27]).

Future work
Apart from exploring the method's applicability to new types and compositions of data, we see strong merit in investigating different and possibly less 'invasive' techniques to integrate prior knowledge of OOD data. It would be particu-larly interesting to encourage the model to optimize toward a joint distribution of ID and OOD data without externally enforcing specific energy levels. For the application to early printed documents with often very localized variations in shape, this direction might prove beneficial. Furthermore, we want to investigate compositional patterns of specialized EBMs using different glyph concepts. As shown by Du et al. [16], we can use concept negation to learn a model distribution over data that does not include certain concepts, which might be viable for the goal of OOD detection.

Author Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Florian Kordon and Nikolaus Weichselbaumer. The first draft of the manuscript was written by Florian Kordon, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL. We acknowledge funding by the Arts and Humanities Research Council (AHRC) and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)-468415227.

Availability of data and materials
The data stems from a copy of Johannes Balbus' Catholicon held by the John Rylands Research Institute and Library Incunable Collection, Manchester, UK. The digital scans are to be made publicly available soon after the internal verification procedures are completed.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.     Table 6 Overview of considered score functions s θ (x) for discriminative and hybrid OOD detection OOD detection method s θ (x) Discrim.

x)[y])
GradNorm ‡ [36] ∂ DKL(u||softmax( fθ (x)) ∂θ p KNN § [63] 1{− z * − z (k) 2 > λ th } Hybrid Pred. prob. [31] m a x y q θ (y|x) Input density [28] l o g q θ (x) Prob. mass [28] ∂ log qθ (x) ∂x 2 We refer the reader to the original articles for a detailed explanation of the respective methods. Hyper-parameter selection for ODIN, Mahalanobis, GradNorm, and KNN was done according to the minimum FPR95 on the validation set. c ,ˆ are the empirical class mean and covariance calculated on the training data, d(x) is the distance between the sample and the classcond. Gaussian as d(x) = g θ (x) −μ c , and g θ (x) is the output of the penultimate layer ‡ u is to a uniformly-valued vector, ||·|| p marks the L p norm § z (k) mark feature vectors based off the training set, z * marks the normalized feature vector of the test sample Evaluation was done on the validation splits of the OOD datasets using the AUPRC metric Best score functions for each OOD dataset are marked in bold