1 Introduction

The invention of printing with movable type is understood by many as the single most important innovation of the European Middle Ages. Since its introduction, letterpress printing has redefined the way we gather, process, and hand down information and knowledge from generation to generation. To analyze the contents of the earliest printed books, called incunables or incunabula, digital analysis techniques have become a valuable tool, taking their place alongside the bibliographical and photographic methods used by librarians and book scholars alike. Today, optical text and character recognition (OCR) is widely used to extract individual text components automatically or to make complete transcriptions of digital scans [21].

While modern OCR techniques build on advances in Deep Learning and really excel in transcribing printed media, their use in localizing and identifying the most elemental graphic units of text—so-called glyphs—often falls short of expectations. By design, state-of-the-art OCR analyzes lines of text to capture long-range semantic relations, making do with a rather crude localization in the process. At the same time, detecting glyphs with older rule-based algorithms [59] often fails to generalize confronted with uncommon font groups and it is also surprisingly susceptible to image degradation such as smears and smudges or other pathological residuals such as painted initials, lombards, rubrication, and handwritten marginalia. Moreover, frequently occurring sequences of adjacent characters are often represented in early printing as a ligature, using a single glyph printed with a compound type. Despite subtle differences in visual appearance and general shape, the contained graphemes are mostly not separated, either intentionally or due to the OCR approach’s inability to do so. This behavior is undesirable for a downstream task that seeks to analyze and compare intricate details of glyph shape occurring in different printed editions and copies.

Fig. 1
figure 1

Overview of our method. A joint energy-based model (JEM) is used for glyph classification and separation of in-distribution (ID) and out-of-distribution (OOD) data

In this paper, we will show that these issues can be mitigated by extending the OCR-based glyph extraction with a subsequent classification and out-of-distribution (OOD) detection stage (Fig. 1). Such finetuning is an attractive solution that allows us to take advantage of the powerful features of current OCR methods, such as the enormous amounts of available transcription ground truth, the availability of optimized pre-processing steps, and the ease of generalization. To do this, we use a joint energy-based model (JEM) [28] to estimate the true probability distribution of the target glyph images and classify the OCR-based image patches into a single character class. We then exploit the distribution estimate to separate in-distribution (ID) and OOD samples of ’clutter’ images (e. g., empty spaces, rubrications, initials) or ligatures. Through a comprehensive series of experiments on a digital reproduction of an incunable, we observe strong classification and OOD detection performance, surpassing other state-of-the-art inference-time detection strategies. We further demonstrate the method’s robustness when exposed to images with different types of degradation frequently observed in scannings of printed historical documents.

Our contributions to digital document analysis focus on the following areas:

  • We show that classifying glyphs with a hybrid model matches the performance of purely discriminative approaches, despite being additionally tasked with density estimation.

  • We show that enforcing a margin between ID and OOD energies during the training of JEMs substantially increases OOD detection performance for image clutter and separation between single characters and ligatures.

  • We introduce an effective extension to JEM’s generative sampling that uses information about the typical set’s geometry, i. e., regions with high contribution to the expectation, to stabilize the optimization process.

2 Related work

2.1 Detection of glyphs and symbols on historical prints

Detecting and localizing glyphs or symbols on historical prints for reproduction or fine-grained analysis is mostly coupled to the more general efforts and algorithmic developments in OCR [21]. Early approaches typically rely on geometric rule-sets [14], classical shape descriptors [48], or carefully crafted features in a machine learning approach [73] to locate and categorize recurring symbols. Subsequent methods use different variants of template matching [6, 7, 9, 25], employ graph matching [13], or follow the advances in deep learning with a particular focus on few-shot reference learning [61], object detection [8], or probabilistic generative modeling for generalized glyph shape assessment [26].

2.2 Energy-based models

The concept of energy-based models (EBM) dates back to early research on the collective properties of neural networks [35] and Boltzmann machines [1] before being set forth as a distinct approach by LeCun et al. [41]. Since then, EBMs have become an essential member of the family of deep probabilistic and generative networks [4]. They are used for implicit generative modeling [20, 54], predictive modeling [28], continual learning [44], iterative reasoning [18], compositional learning [16, 17], and OOD detection [22, 65]. They also have been used to re-interpret the workings of other approaches such as Variational Autoencoders [69] or Generative Adversarial Networks [11]. We refer the reader to the very instructive works by Du et al. [20], Song et al. [60], Osogami [55], and Arbel et al. [2].

2.3 OOD detection

OOD samples typically stem from a secondary distribution that differs from ID data by a semantic shift in the label space or a covariate shift. Yang et al. [71] provide a detailed review on OOD detection methods and their commonality with open set/open world recognition [39]. In the context of predictive modeling, popular OOD detection directly rely on the softmax confidence scores [31, 45], evaluate the distance of the sample to the nearest class-specific distribution [42], or utilize energy scores computed on the classifier’s logits [46]. Recent methods build upon the observation that ID data tends to show higher vector norms in gradient space compared to OOD data [36], or evaluate the k-nearest neighbor distances between feature vectors of the training set and an inference sample [63]. Furthermore, some early works explore the idea of the typical set for OOD detection [12, 50].

3 Theoretical background

In this section, we begin by introducing the general concept of EBMs and how they can be trained for modeling some probability density. After that, we will show how this notion can be used within conventional classification tasks using JEMs [28]. Following that, we will explain how these hybrid models can be used to detect OOD samples and how we can aid this goal even further by actively shaping the energy surface. Lastly, we will propose a correction for a more stable optimization of JEMs based on the concept of typical sets.

3.1 Energy-based models

Given a limited number of observations \({\textbf{x}} \in {\mathbb {R}}^{D}\), we often want to explain the underlying distribution of that data by constructing an estimate \(q_{\theta }({\textbf{x}})\) of the true probability density function \(p({\textbf{x}})\), ideally using as few assumptions about the data as possible.

Energy-based models seek to establish a relation between these probabilities and certain energy states using a parameterized energy function [41]. This relation is expressed as a Boltzmann distribution \(q_{\theta }({\textbf{x}})= \frac{\exp \left( -E_{\theta }({\textbf{x}})\right) }{Z_{\theta }}\), where \(E_{\theta }({\textbf{x}}): {\mathbb {R}}^D\mapsto {\mathbb {R}}\) is the energy function with parameters \(\theta \) which maps a point of a D-dimensional input space to a single scalar state, called the energy [37]. The partition function \(Z_{\theta } = \int _{{\textbf{x}}}\exp \left( -E_{\theta }({\textbf{x}})\right) d{\textbf{x}}\) normalizes the energy magnitudes with respect to \({\textbf{x}}\) to ensure that the fundamental properties of a probability distribution are satisfied. If the parameters \(\theta \) are configured well, realistic observations are assigned low energy values, and unlikely observations correspond to high-energy states. Intriguingly, the choice for \(E_{\theta }({\textbf{x}})\) places no other restrictions on the function type and can be efficiently parameterized by a neural network with weights \(\theta \).

For most scenarios, optimizing these weights with conventional maximum likelihood learning on some training data \(\mathcal {X}_{\text {train}}\) is not straight-forward as evaluating the normalization constant \(Z_{\theta }\) (which is required to assess the likelihood of \({\textbf{x}}\)) is either analytically intractable or requires time-consuming numerical integration. On the other hand, only estimating unnormalized densities by maximizing \(\exp \left( -E_{\theta }({\textbf{x}})\right) \) is not feasible since the likelihood of the training data changes with different values for \(Z_{\theta }\).

Fortunately, even though we cannot estimate the normalized densities directly, we can use Contrastive Divergence (CD) to approximate the gradients of the energy function [10, 33, 34, 67]. More formally, the partial derivative of the negative log-likelihood objective \(-\log q_{\theta }({\textbf{x}})\) with respect to weights \(\theta \) can be expressed as

$$\begin{aligned} \frac{{\partial }\mathcal {L}_{\text {ML}}(\theta ; p)}{{\partial }\theta }&= -{\mathbb {E}}_{p({\textbf{x}})} \left[ \frac{{\partial }\log q_{\theta }({\textbf{x}})}{{\partial }\theta }\right] \nonumber \\&= {\mathbb {E}}_{p({\textbf{x}})} \left[ \frac{{\partial }E_{\theta }({\textbf{x}})}{{\partial }\theta }\right] - \frac{{\partial }\log Z(\theta )}{{\partial }\theta } \nonumber \\&\approx {\mathbb {E}}_{p({\textbf{x}})} \left[ \frac{{\partial }E_{\theta }({\textbf{x}})}{{\partial }\theta }\right] - {\mathbb {E}}_{q_{\theta }({\textbf{x}})} \left[ \frac{{\partial }E_{\theta }({\textbf{x}})}{{\partial }\theta }\right] \end{aligned}$$
(1)

This contrastive update scheme increases the likelihood of samples from the true distribution \(p({\textbf{x}})\) while reducing the likelihood of samples from the generated distribution \(q_\theta (x)\), eventually reaching an optimum at the equilibrium \(p({\textbf{x}})=q_{\theta }({\textbf{x}})\).

Now, instead of evaluating the integration of \(Z(\theta )\), we approximate it numerically by taking a sample from the model distribution \(q_{\theta }({\textbf{x}})\). Although we are still not able to analytically sample from \(q_{\theta }({\textbf{x}})\), we can set up a Markov Chain that stochastically explores the model distribution and eventually converges to the equilibrium/steady-state distribution \(p({\textbf{x}})\) using \(K\) cycles of Monte Carlo sampling (MCMC). This iterative transition exploits the ratio between two evaluated points \(q_{\theta }({\textbf{x}})\) and \(q_{\theta }(\mathbf {x^{\prime }})\), which allows to cancel out the partition function (cf. [68]).

To reduce very long mixing times of the Markov chain, we can use an alternative sampler in continuous space state based on Stochastic Gradient Langevin Dynamics (SGLD) [5, 41, 53, 67, 70]:

$$\begin{aligned} {\textbf{x}}^{0}&\sim \pi ({\textbf{x}}) \nonumber \\ {\textbf{x}}^{k+1}&= {\textbf{x}}^{k} + \frac{\eta }{2} \frac{{\partial }\log q_{\theta }({\textbf{x}}^{k})}{{\partial }{\textbf{x}}^{k}} + \epsilon , \quad \epsilon \sim \mathcal {N}(0, \eta ) \nonumber \\&= {\textbf{x}}^{k} - \frac{\eta }{2} \frac{{\partial }E_{\theta }({\textbf{x}}^{k})}{{\partial }{\textbf{x}}^{k}} - \underbrace{\frac{{\partial }\log Z(\theta )}{{\partial }{\textbf{x}}^{k}}}_{=0} + \epsilon \nonumber \\&= {\textbf{x}}^{k} - \frac{\eta }{2} \frac{{\partial }E_{\theta }({\textbf{x}}^{k})}{{\partial }{\textbf{x}}^{k}} + \epsilon \;, \end{aligned}$$
(2)

where \(\pi ({\textbf{x}})\) is a prior distribution from which a sample can easily be drawn (e. g., a Gaussian distribution), \(\eta \) is the step size, and \(\epsilon \) is an additive Gaussian noise. This update equation adapts a random initial noise by evaluating the gradients of the modeled probability density and behaves like gradient descent toward lower energy values.

3.2 Classification with joint energy-based models

We recall that EBMs estimate the probability density \(q_{\theta }({\textbf{x}})\) over the observation space with a high likelihood for data that are part of \(\mathcal {X}_{\text {train}}\). In contrast, predictive classification models want to estimate the conditional distribution \(q_{\theta }(y\vert {\textbf{x}})\) for data points \({\textbf{x}}\in {\mathbb {R}}^{D}\) and labels \(y\in {\mathbb {R}}\) over a number of distinct classes \(C\). In practice, this distribution is most often estimated using a parametric mapping \(f_{\theta }({\textbf{x}}): {\mathbb {R}}^{D}\mapsto {\mathbb {R}}^{C}\) that assigns real numbers (logits) for all classes to each data point. The logits are subsequently normalized to obtain the categorical distribution as

$$\begin{aligned} q_{\theta }(y\vert {\textbf{x}}) = \frac{\exp (f_{\theta }({\textbf{x}})[y])}{\sum _{y^{\prime }}\exp (f_{\theta }({\textbf{x}})[y^{\prime }])} \end{aligned}$$
(3)

where \(\exp (f_{\theta }({\textbf{x}})[y])\) is the \(y^{\text {th}}\) logit.

As shown by Grathwohl et al. [28], both these concepts can be elegantly combined by interpreting the classifier’s logits as unnormalized densities of the joint distribution \(q_{\theta }({\textbf{x}}, y) = \exp (f_{\theta }({\textbf{x}})[y])/Z(\theta )\). In other words, we can use the logits of a discriminative classifier to define an EBM using the relation \(E_{\theta } = -f_{\theta }({\textbf{x}})[y]\).

The marginal distribution over \({\textbf{x}}\) then defines a density model over the data \({\textbf{x}}\) as

$$\begin{aligned} q_{\theta }({\textbf{x}}) = \sum _{y}{q_{\theta }({\textbf{x}}, y)}=\frac{ \sum _{y}\exp (f_{\theta }({\textbf{x}})[y])}{Z(\theta )}. \end{aligned}$$
(4)

Training \(f_{\theta }\) with respect to the joint distribution can be efficiently done using the factorization \(\log q_{\theta }({\textbf{x}},y) = \log q_{\theta }({\textbf{x}}) + \log q_{\theta }(y\vert {\textbf{x}})\) [28, 56]. This allows us to optimize the categorical distribution \(q_{\theta }(y\vert {\textbf{x}})\) using the cross-entropy objective function and to optimize the densities \(q_{\theta }({\textbf{x}})\) with CD. The resulting hybrid model (cf. [49]) with generative and predictive components is known as a Joint Energy-based Model (JEM) [28].

Within SGLD, we can then draw samples from \(q_{\theta }\) in two ways: (1) Computing gradients with respect to \(\log \sum _{y}{\exp (f_{\theta }({\textbf{x}})[y]))}\), which corresponds to unconditional sampling. (2) Computing gradients with respect to \(\log \exp (f_{\theta }({\textbf{x}})[y])\), which allows class-conditional sampling.

3.3 Incorporating out-of-distribution data with energy margins

One characteristic trait of EBMs is their ability to assign substantially different energy states to ID data and data suffering from semantic or covariate shift, more generally called OOD data. Thus, EBMs implicitly are competitive classifiers to distinguish between ID and OOD data.

Critically, however, optimization with CD does not enforce steady energy states for data sampled from \(p({\textbf{x}})\) and \(q_{\theta }({\textbf{x}})\) as it is unaffected by any additive bias on the energy function’s output. The energy magnitudes fluctuate across the CD iterations and might drift into numerically unstable ranges, which negatively affects the contrastive shape of the energy surface by introducing steep gradients and making it increasingly non-smooth. This process leads to overall worse approximations to the expectation of the energy w. r. t. the model distribution, such that finding an appropriate discrimination threshold for OOD detection becomes very difficult.

An intuitive and simple concept to counteract this behavior is to establish explicit energy levels for \(p({\textbf{x}})\) and \(q_{\theta }({\textbf{x}})\) during CD. Based on the idea of Du et al. [20], we add \(L2\) regularization to constrain the Lipshitz constant, i. e., the in-distribution loss becomes:

$$\begin{aligned} \mathcal {L}_{\text {ID}}(\theta )=E_{\theta }({\textbf{x}}^{+})^{2} + E_{\theta }({\textbf{x}}^{-})^{2}, \end{aligned}$$
(5)

with \({\textbf{x}}^{+}\sim p({\textbf{x}})\) and \({\textbf{x}}^{-}\sim q_{\theta }({\textbf{x}})\). This regularization term ultimately enforces an energy around \(E_{\theta }({\textbf{x}}^{+})\approx 0\). Because of the remaining approximation bias in \(q_{\theta }({\textbf{x}})\), the energy for \({\textbf{x}}^{-}\) typically ends up being a little bit higher.

In addition, we are often interested in specific scenarios where we know well in advance certain characteristics of OOD data that might interact with our system. This information is particularly valuable when the characteristics relevant for distinguishing between ID and OOD data differ only by nuances. We can assist the contrastive optimization scheme by explicitly introducing known (and task-specific) samples for which we enforce a high energy level.

Fig. 2
figure 2

Illustration of enforcing an energy margin between ID and OOD data. Figure adapted from Liu et al. [46]

Let us consider a secondary dataset \(\mathcal {X}_{\text {OOD}}\) with known OOD examples \({\textbf{x}}^{*}\in {\mathbb {R}}^{D}\). For each CD iteration, we can draw a random sample from this dataset and penalize the assigned energy value \(E_{\theta }({\textbf{x}}^{*})\) if it is below a certain margin \(\tau \in {\mathbb {R}}\). We build here on the idea of Liu et al. [46] and implement this penalty term using a squared Hinge loss as

$$\begin{aligned} \mathcal {L}_{\text {OOD}}(\theta ) = \max (0, \tau -E_{\theta }({\textbf{x}}^{*}))^{2}. \end{aligned}$$
(6)

The effect of this energy regularization for OOD data is illustrated in Fig. 2.

3.4 Introducing probability mass to SGLD sampling

A defining concept in CD is to numerically approximate the expectation of the target probability density using an MCMC estimator. To find a good estimate in a computationally efficient manner, we must make sure to focus on regions in the sample space that contribute significantly to the expectation [3]. An intuitive strategy for sampling the model distribution is to explore regions around the mode where we observe high density. This objective agrees well with the concept of maximum likelihood models but is unable to take into account the geometric composition of high-dimensional parametric spaces. Specifically, regions of high probability density lie in a compact neighborhood around the mode but constitute only a small volume. On the other hand, regions of high volume but low density are located toward the tails of the distribution [3]. This effect strengthens and consolidates with increasing dimensionality, such that explorations of the sample space that are based on high-density sampling inevitably ignore important parts of the contributions to the expectation. Note that the strength of this contribution at any point is determined by the product of density \(q_{\theta }({\textbf{x}})\) and volume \(d{\textbf{x}}\), known as the probability mass. Consequently, in order to locate the areas with the most significant contributions to the expectation, we have to carefully consider both the density and the volume. The region where the highest mass concentrates is located between the two extreme points and is generally known as the typical set [3, 12, 50].

This notion of a typical set plays a decisive role when modeling Markov chains. For a well-structured target distribution that is preserved under some Markov transitions, the Markov chain first navigates toward the typical set and then explores it further to reduce the strong initial bias until consolidation. Unfortunately, this process may be disrupted as the Markov transitions are limited in their ability to effectively probe certain types of regions, e. g., high-curvature non-smooth neighborhoods which are very common in practice. For low mixing times, this leads to heavily biased Monte Carlo estimators. With longer mixing times, the Markov chain can sporadically explore these regions, which, however, leads to substantial oscillation in expectation. As a result, running a finite Markov chain in presence of such pathological areas most likely leads to biased and sub-optimal estimators which never reach the true distribution.

Fig. 3
figure 3

Simplified illustration of MCMC with an SGLD sampler. Under ideal conditions, the Markov Chain will eventually converge to areas of high probability mass, the typical set, and then explores them further. By querying information about the structure of the typical set at each MCMC cycle, we can inform the Markov transitions about areas with locally high contributions to the expectation. For finite Markov chains, using only information about the probability density leads to slower mixing times and thus may result in more biased samplers

It can therefore be advantageous to calibrate the Markov transitions to the geometry of the typical set by querying the target distribution. We propose a lightweight and efficient implementation of this idea by introducing a correction term to the SGLD update equation [Eq. (2)]. This correction harnesses information about the typical set’s geometry at each update step to prioritize Markov states that maintain a certain level of probability mass (Fig. 3).

As a notion for the probability mass at a point, \(q_{\theta }({\textbf{x}}) d{\textbf{x}}\), we use a differential approximation presented by Gratwohl et al. [28] using gradients of the log-density/energy function. As an approximation, the Euclidean norm of these gradients should be small for areas of high mass. Formally, this relation can be written as follows:

$$\begin{aligned} q_{\theta }({\textbf{x}}) d{\textbf{x}}\, \propto \, {m_{\theta }({\textbf{x}})}^{-1} \end{aligned}$$
(7)

where

$$\begin{aligned} m_{\theta }({\textbf{x}}) = \left| \left| \frac{{\partial }\log q_{\theta }({\textbf{x}})}{{\partial }{\textbf{x}}}\right| \right| _{2} = \left| \left| \frac{{\partial }E_{\theta }({\textbf{x}})}{{\partial }{\textbf{x}}}\right| \right| _{2}. \end{aligned}$$
(8)

This correction term can then be used during SGLD update to steer the Markov transitions along a trajectory with sufficient probability mass toward the typical set:

$$\begin{aligned}&{\textbf{x}}^{0} \sim \pi ({\textbf{x}}) \nonumber \\&{\textbf{x}}^{k+1} = {\textbf{x}}^{k} - \frac{\eta }{2}\left( \gamma \frac{{\partial }E_{\theta }({\textbf{x}}^{k})}{{\partial }{\textbf{x}}^{k}} + (1-\gamma ) m_{\theta }({\textbf{x}}^{k})\right) + \epsilon \end{aligned}$$
(9)

\(\gamma \in [0,1]\) is a relative weighting factor that can be tuned to find a balance between high-density and high-mass sampling. In the following, we call this concept mass-corrected sampling (MCS).

3.5 Final objective function

Putting it all together, we arrive at the following objective for JEM training:

$$\begin{aligned} \mathcal {L}_{\text {JEM}} = \mathcal {L}_{\text {ML}} + \alpha \mathcal {L}_{\text {CLS}} + \beta \mathcal {L}_{\text {ID}} + \lambda \mathcal {L}_{\text {OOD}}. \end{aligned}$$
(10)

\(\mathcal {L}_{\text {ML}}\) marks the CD policy [Eq. (1)] to optimize the densities \(q_{\theta }({\textbf{x}})\) with the standard biased SGLD sampler according to Eq. (2), or the mass-corrected sampler according to Eq. (9). \(\mathcal {L}_{\text {CLS}}\) with weight \(\alpha \in {\mathbb {R}}\) marks a cross-entropy objective used to optimize the predictive component \(q_{\theta }(y\vert {\textbf{x}})\). \(\mathcal {L}_{\text {ID}}\) with weight \(\beta \in {\mathbb {R}}\) corresponds to the L2 regularization given in Eq. (5), and \(\mathcal {L}_{\text {OOD}}\) with weight \(\lambda \in {\mathbb {R}}\) is the energy margin penalization for OOD data given by Eq. (6). Unless stated differently, we train the models using Eq. (10) with weights \(\alpha =1\), \(\beta =0.1\), \(\lambda =0\), and default SGLD (Eq. (9)) with \(\gamma =1\).

4 Application to glyph classification

In this section, we describe the data to which we want to apply the algorithms presented in the previous section. First, we specify the selection and preparation of glyphs from an incunable edition. We follow that by describing two OOD datasets that cover frequently observed pathological residuals in incunables which are often falsely detected by OCR frameworks. Finally, we provide details on data pre-processing, model implementation, and the metrics we use for method evaluation.

4.1 Datasets

4.1.1 Incunable glyph data

In order to obtain a large corpus of incunable glyphs, we followed a two-step process. First, we extracted possible candidates and an estimate of their associated class using an OCR engine trained on a large corpus of synthetic text lines with Latin charactersFootnote 1 [64]. This step was followed by careful manual selection and curation for each class by a domain expert.

In particular, we set up a pre-processing and glyph extraction pipeline with OCR-D [51] which included the following operations in sequential order: (1) binarization, (2) cropping, (3) denoising, (4) region segmentation, (5) combination of rule-based (glyph-level) and LSTM-based OCR (line-level) with a language model that was pre-trained on synthetic Latin prints in mostly Antiqua and Fraktur font,Footnote 2 (6) glyph extraction based on position decoding Connectionist Temporal Classification (CTC) (the detailed workflow is given in Table 4).

All glyph candidates were then categorized based on the most probable grapheme as identified by the OCR engine.

We used this pipeline on high-resolution scans of the first 71 pages of a digital reproduction of Johannes Balbus’ Catholicon from the copy held by the John Rylands Research Institute and Library Incunable Collection (shelfmark: 3383) (cf. ISTC ib00020000 [38], GW03182 [24]). This edition of the Catholicon was presumably printed in 1460 using a Gotico–Antiqua type. The number of extracted candidates over all characters defined in the language model was 345,152.

After extraction, we defined 42 classes covering the most prevalent lower- and uppercase characters according to a manually curated glyph table [66]. For each class, all described allographs of the character apart from typographic ligatures were considered. Following that, up to 1,000 random samples from each class (if applicable) were selected before being reviewed and pruned by an expert in typographic printing and historical book studies. The cleaned corpus contains 16,171 images distributed across all classes, split into train/validation/test subsets according to a 70 %/15 %/15 % ratio (11,319/2426/2426 images).

4.1.2 Out-of-distribution data

We optimize and evaluate the OOD detection performance on three auxiliary datasets. Each set contains samples with selected characteristics that lead to an undesirable but plausible shift in the true distribution \(p({\textbf{x}})\). The selection of these characteristics was based on the expertise of book scholars, their analyses of OCR extraction results, and their definition of undesirable patterns in subsequent reproduction.

  1. (1)

    Clutter We identify frequent false-positive candidates from the OCR extraction pipeline and combine them in a single OOD dataset. The selection covers images with rubrications, illuminations, lombards, hand-written notes and initials, empty spaces, and images that frame multiple text lines or multiple adjacent characters. These properties can be attributed to a different label space, thus constituting a semantic shift. For all 2,802 samples, we follow the same data split as for the ID glyphs dataset and use randomly selected train/validation/test subsets with a 70 %/15 %/15 % ratio (1961/420/421 images).

  2. (2)

    Ligatures Images of typographic ligatures which are centered on one of the constituting graphemes can share great visual similarity to the stand-alone characters that we consider as part of the 42 classes. We build upon this covariate shift and construct an OOD dataset that incorporates 614 such off-centered ligatures. Similarly, we use a 70 %/15 %/15 % split with 429/92/93 randomly assigned images.

  3. (3)

    Mixed We combine all images from Clutter and Ligatures maintaining their split.

4.2 Data pre-processing

All images were converted from RGB to grayscale and were padded with zeros according to the largest edge in height and width dimensions present in the incunable glyph data. The smaller image dimension was then reshaped to \(56\,\textrm{px}\) while matching the larger dimension according to the original aspect ratio. Lastly, we center-cropped the images to a uniform spatial resolution of \(56 \times 56\,\textrm{px}\), facilitating mini-batch training. Importantly, these steps maintain the relative scale between the image samples.

4.3 Neural network architecture and training protocol

We use a lightweight Convolutional Neural Network (CNN) for all experiments and ablation studies. The network topology is provided in Appendix Table 5.

We use a maximum epoch size of 120, an Adam optimizer with a default learning rate of 0.0001 with exponential decay every second epoch with a factor of 0.97, and a batch size of 32 across all datasets. We further configure the biased SGLD sampler with \(K=60\) MCMC cycles, a large learning rate \(\eta =20\), and a reduced amount of injected noise with \(\sigma =0.005\), promising faster convergence in scenarios with little cycles \(K\) albeit straining the balance between the sample variance during stochastic gradient descent (early mixing) and the variance during posterior sampling via SLGD (late mixing) [67].

In addition, we use the concept of reservoir sampling/replay buffer [19, 20] to reduce the Markov Chain mixing times by using past samples to initialize SGLD sampling. For each mini-batch, we use 95 % buffer images and 5 % new Gaussian noise samples.

4.4 Evaluation metrics

Classification    We evaluate the classification performance using two threshold-independent metrics: (1) we compute the area under the precision-recall curve (AUPRC) using a one-vs-rest approach, and (2) we evaluate the area under the receiver operating characteristic curve (AUROC). We also report the expected calibration error (ECE) [29] to measure the alignment between the predicted softmax probabilities and the classification accuracy.

OOD detection    We evaluate the OOD detection performance by interpreting it as a binary classification problem between ID data (true class with label 1) and OOD data (false class with label 0). We report the AUPRC and AUROC (trapezoidal rule), as well as the false-positive rate at a true-positive rate of 95 % (FPR95).

5 Experimental results

In this section, we evaluate JEMs and the presented extensions on the glyph dataset. We start by comparing its classification performance with that of a purely discriminative approach. Then, we evaluate the model’s ability to detect OOD samples and examine if learning a margin between ID and OOD data in the energy spectrum can improve this performance. Furthermore, we analyze the correction term for SGLD concerning the typical set and study its impact on training stability and general performance. Finally, we assess the algorithm’s susceptibility to several types of image degradation and inspect generative samples of the typical set for a subset of the classes.

5.1 Glyph classification with JEMs

We start by comparing the classification performance between a discriminative baseline model (CNN) and unconditional and class-conditional JEMs in default configuration (cf. Sect. 3.5). The results are summarized in Table 1.

The discriminative model (CNNs) shows a strong classification performance with an AUPRC of 0.972 and excellent confidence calibration. Despite being additionally tasked with density estimation, the JEM with unconditional sampling even improves beyond this predictive model with a minor increase in AUPRC, albeit at a higher calibration error. As can be observed in Fig. 4, the most common characteristics among the wrongly assigned images include margin crops, lack of centering, label noise (ligatures, rubrications), or partial under-inking which leads to unfavorable similarity to other classes. Moreover, underrepresented classes at the long tail suffer from diffuse predictive probabilities, so that assignments to similar classes are very likely (e. g., ‘k’ \(\rightarrow \) ‘h,’ ‘G’ \(\rightarrow \) ‘E’).

Table 1 Comparison of classification results between a predictive model and unconditional and class-conditional JEMs (hybrid)
Fig. 4
figure 4

Examples of falsely assigned image samples by the unconditional JEM model. The most common characteristics include margin crops, missing centrality, and label noise (e. g., ligatures and annotation errors)

Interestingly, class-conditional sampling (JEM-CC) leads to noticeably worse results with a difference of one percentage point to the unconditional variant (Table 1). By design, class-conditional sampling forces the Markov Chain to concentrate into a distinct and more singular area where the probability mass for the specific target class is high. Given that we start the Markov Chain with a random initial point in the sample space, this region may be distant or may require a trajectory across or out of regions with high contribution to the expectation of visually very similar classes. Although this confounding effect is heavily reduced by using a class-specific sampling reservoir (as we initialize the sampling at nearby points), it could be even further alleviated by increasing the Markov Chain mixing time to compensate for the longer or more complex trajectories.

5.2 OOD detection of clutter and ligatures

In this section, we compare the OOD detection performance of the hybrid JEM models with different discriminative inference-time detection methods. In particular, we evaluate different choices of the score function \(s_{\theta }({\textbf{x}})\in {\mathbb {R}}\) for ID and OOD samples and evaluate the separability of the two resulting score distributions (cf. Appendix Table 6).

We consider three OOD detection strategies that make use of the classifiers’ softmax scores. The baseline method by Hendrycks et al. [31] uses the magnitude of the classifier’s highest softmax score, which we refer to as predictive probability. Intuitively, ID images should be assigned a single high class score close to one, whereas OOD samples are expected to yield a rather diffuse probability, ideally approximating a uniform probability over all classes. In ODIN [45], the authors build on this idea and use temperature scaling and input perturbations based on the image gradients to increase calibration of the prediction confidence (cf. [29]), allowing to push apart the predictive probabilities between ID and OOD image samples. While ODIN exploits the gradient domain only implicitly to calculate perturbations, GradNorm by Huang et al. [36] directly derives scores from the gradient space by calculating the vector norm of the gradients backpropagated from the Kullback–Leibler divergence between the softmax output and a uniform distribution. We observed that the scores obtained with this approach were significantly higher for ID data than for OOD data. Moreover, we evaluate score functions based on the free energy score across all class logits [46], as well as on the Mahalanobis distance of a sample to the nearest class-conditional Gaussian distribution using a notion of an induced generative classifier under Gaussian discriminant analysis (GDA) [42]. This strong assumption of the class-conditional distribution in the feature space was pointed out by Sun et al. [63] as a potential weakness. They proposed to use a non-parametric estimate using k-th nearest neighbor (KNN) distances between the trained ID embeddings and the embedding of the input sample. For the hybrid models, we use again predictive probability, which performed better than energy-based scores based on probability density and mass during an initial analysis on the OOD datasets’ validation splits (cf. Appendix Table 7).

Table 2 Comparison of OOD detection scores on the test splits of Clutter and Ligatures datasets

The results in Table 2 reveal that most OOD detection methods reliably separate the clutter images from ID data. Surprisingly, applying GradNorm leads to a much worse separation of the two distributions, suggesting that the differences in gradient magnitude are not sufficiently discriminative. We hypothesize that the inter-class differences in the glyph are smaller than those in the ID datasets considered in the original work, which might lead to less informative gradient norms. The KNN method yields the best results by a large margin. In comparison, all discriminative methods yield noticeably worse detection rates for the OOD ligature data. Interestingly, an additional calibration of the prediction confidence does not aid OOD detection. For example, ODIN utilizes perturbations of the input to increase the maximum softmax score of the ID samples. As we already observe exceptionally high scores for the ID data at well-calibrated confidence (cf. Table 1), the effect is seemingly negligible. It should also be noted that neural networks in many scenarios tend to produce overconfident predictions [30, 32, 43, 47, 52]. These scores can also be disproportionately high for OOD data, even if the data is far away from the softmax decision boundary [42], which curtails the effectiveness of ODIN further.

Using the Mahalanobis distance score significantly improves the ligature detection rate in terms of AUPRC and AUROC scores compared to the predictive probability baseline. At the same time, the FPR at a fixed threshold of 95 % TPR is shown to be worse than the predictive probability baseline, highlighting the importance of adequate threshold selection. Because the class-conditional distributions relevant for calculating the distance-based scores are derived using features from different receptive fields, we argue that both local details and global context are important to capture the intricate differences between stand-alone and linked allographs. For example, ligatures starting with the character ‘f’ might connect to the adjacent character by a slight extension of the ascender (e. g., ‘l’), the crossbar (e. g., ‘o’), or might only show an infiltration into the baseline region without a continuous ink trace.

Similar to the observations on the clutter data, the KNN-based OOD detection yields very competitive AUROC and FPR95 scores that substantially improve over the predictive probability baseline. However, while this tells us that KNN produces very few false positives (ID samples assigned to the OOD distribution) given a high TPR, the low AUPRC indicates that we still obtain a relatively large absolute count of false positives. Furthermore, KNN explicitly uses only feature embeddings from the penultimate layer to the nearest neighbor distances. For that reason and based on the observation of the Mahalanobis distance score, it would be interesting to investigate whether a weighted addition of earlier network layers with a stronger local context affects detection quality.

The unconditional JEM without any further modifications matches the OOD detection rate of the best-performing discriminative models for ligatures and clutter images. In comparison, actuating the trend in classification results, the class-conditional JEM achieves worse results than the unconditional variant, which is particularly evident in the poorer detection of ligatures. The four percentage point difference in AUPRC performance suggests that it is useful to capture overarching features across all classes, such as often occurring linking patterns between adjacent characters.

Table 3 Effect of explicit energy margins (EM) and mass-corrected sampling (MCS) with \(\gamma =0.5\) on classification and OOD detection performance

5.3 Effects of an energy margin between ID and OOD data

In this section, we introduce energy margins between ID and OOD data with an additional Hinge loss term (cf. Sect. 3.3) and evaluate its influence on classification and OOD detection.

We begin by exploring the hyper-parameter space for different combinations of the energy margin \(\tau \in \{3,6,9\}\) and the multiplicative weight \(\lambda \in \{0.01, 0.03, 0.1\}\). Because we are interested in high classification accuracy and high OOD detection performance at the same time, we select the hyper-parameters according to the highest harmonic mean of classification AUPRC and OOD AUPRC for clutter and ligatures on the validation sets (Appendix Fig. 8). The best-performing configurations were found to be \((\tau =3;\,\lambda =0.03)\) for JEM and \((\tau =3;\,\lambda =0.01)\) for JEM-CC. We observed that the class-conditional variant suffers from a diverging CD training loss with eventual breakout for almost all runs with higher margins or weights. While this phenomenon affects the classification abilities only marginally, it impairs the OOD detection on both OOD datasets and leads to unstable training objectives.

We now apply the models with a tuned energy margin on the test set (Table 3). The explicit separation of clutter images yields only minor improvements to the OOD detection scores as these images are already assigned a sufficiently large energy value by default. This can be explained by a large difference in observed characteristics between members of the target classes and the considered types of clutter. Conversely, enforcing a margin between ID data and ligature images shows over 2 percentage points improvement in OOD detection AUPRC and reduces the FPR95 by 17.2%. This detection performance is also notably better than that of the KNN method, especially in the absolute number of false positive assignments to the OOD distribution. Here, providing the network with hard OOD samples to shape the energy surface in between ID and OOD data distributions helps to pinpoint intricate visual differences despite the strong semantic similarity. For both unconditional and class-conditional modeling, the classification performance slightly drops when certain energy states are enforced.

A particularly interesting observation can be made for the models that are trained on the combined OOD data pool. Although the OOD detection rates can be improved for both clutter and ligatures, the magnitude of the improvement falls short of the specialized training variants. This effect suggests that different classes of OOD data should not be assigned to the same region of the energy spectrum but rather be treated separately. In this context, future work could investigate into treating different OOD data as additional classes during training of the predictive component.

5.4 Effects of using probability mass in SGLD

We continue by introducing a notion of probability mass to the SGLD update equation (i. e., mass-corrected sampling) and analyze its effects on training stability and sampling quality (cf. Sect. 3.4). The results are summarized in Table 3.

Fig. 5
figure 5

Contrastive divergence training loss curves illustrating the mediating effect of mass-corrected sampling (MCS) for class-conditional JEMs. Larger OOD energy margins \(\tau \) or weights \(\lambda \) require increasingly sharp energy gradients for class separation, which can lead to unstable gradient-based optimization and irrecoverable estimator bias

In general, MCS shows no effect when it is used to train unconditional models. For class-conditional sampling, however, MCS leads to a significant improvement in discriminating between ID images and ligatures with 4.7 percentage points in AUPRC. This trend is reinforced when MCS is used together with an explicit ID/OOD energy margin, albeit at the cost of a noticeable decrease in classification accuracy. From a conceptional perspective, training with a class-conditional sampler requires a comparably sharper separation of class-specific densities in sample space. This effect makes it difficult to fulfill the contrastive update scheme, which builds on the idea of increasing the likelihood of a sample given a specific class while at the same time pulling down the likelihood of samples from the model distribution and known OOD samples. We argue that in such a scenario, MCS helps to stabilize the SGLD update trajectory in more pathological regions by favoring Markov transitions that do not move away too far from the typical set [3].

The visualization of the CD training loss in Fig. 5 shows that this stabilizing effect prevents loss breakouts and the subsequent divergence of the optimization process. This is particularly true for more aggressive hyper-parameter selections with large energy margins \(\tau \) and weighting factors \(\lambda \).

5.5 Effects of image degradations

This section examines the hybrid model’s behavior when exposed to images affected by different degradations frequently observed on historical printed documents. Specifically, we are interested in the impact of these effects on classification and OOD detection performance. die häufige Probleme bei historischen Drucken abbilden. We consider six different image augmentations which are applied and evaluated in an isolated manner.

  1. (1)

    Additive Gaussian noise with \(\sigma \in [0,2]\).

  2. (2)

    Bilinear image interpolation to simulate varying scanning resolution with interpolation factor \(s\in [0.1,1]\).

  3. (3–4)

    Grayscale erosion/dilation to simulate under- and over-inking effects during the printing process with a square structuring element of width \(w\in \{1,2,\ldots ,10\}\).

  4. (5)

    JPEG with compression ratios according to quality levels in \(Q\in [1,100]\).

  5. (6)

    Minor image rotation simulating image skews with rotation angle in \(r\in [-30,30]^\circ \).

Fig. 6
figure 6

Impact of image augmentations on OOD scores and classification performance on all test images. Each panel shows the effects of an isolated augmentation on mean predictive probability and mean energy score (upper axis), and classification and OOD detection (lower axis). a Additive Gaussian noise. b Bilinear image interpolation to simulate varying scanning resolution. c, d, Erosion and dilation to mimic under- and over-inking during printing. e, JPEG compression. f Minor image rotation. OOD NoAug (lower axis) refers to the model’s ability to differentiate the augmented and non-augmented images (evaluation done with \(\text {JEM}\,+\,\text {MCS}\,+\,\text {EM}_{\text {Ligatures}}\).)

The results for each augmentation type are visualized in Fig. 6. With increasing additive Gaussian noise up to a standard deviation of 0.75 px, the classification performance and OOD detection rate remain almost constant. For more aggressive noise configurations, the augmented images increasingly resemble the initial Markov states in SGLD sampling which correspond to very high energies. Furthermore, decreasing the image resolution until the point of losing most of the visual characteristics of the original letter (at 25 % of original resolution) has no considerable effect on the performance metrics. Given that historical documents were produced to inconsistent standards, show strongly varying traces of age or use, and were scanned using a wide array of scanners, cameras, and recording protocols over the past decades, this robustness of the model to different resolutions is a powerful property.

Exposing the network to images with fewer amounts of applied ink (erosion) quite quickly leads to a noticeable drop in classification and OOD detection performance. A particularly interesting observation can be made for the input density which barely drops with different sizes of the structuring element, indicating that the characteristics of under-inking have significant contributions to the expectation. This is in contrast to the quickly decreasing maximum softmax score. Consequently, it can be concluded that also for this scenario an OOD score function based on the predictive part of the hybrid model is the better choice. The decrease in performance is less pronounced for broadening of the characters (dilation) where the drop only occurs after reaching unrealistic levels of ink coverage. Compared to the erosion results, the value by which the energy scores increase is significantly higher as the size of the structuring element increases. It can be concluded that over-inking plays a minor role in the present data set. At the same time, the classification performance remains stable even at moderate dilation rates.

Figure 6 further reveals that even for very aggressive JPEG compression rates both OOD detection scores and the performance metrics hold up very nicely. This is an important property since a considerable amount of scanned material is made available to the public in compressed JPEG format. Also, minor skewing of the scanning page with rotation angles of \(\pm 10^\circ \) still yields sufficient performance scores.

5.6 Qualitative analysis of class prototypes

In this section, we exploit the generative capabilities of our model for class-conditional sampling. Using the notion of probability mass (Eq. (9)), we seek to visualize image characteristics that play a significant role in optimizing the model distribution \(q_{\theta }({\textbf{x}})\). The generated class prototypes for different lower- and uppercase characters are shown in Fig. 7.

Fig. 7
figure 7

Illustration of learned class prototypes for various lower-/uppercase graphemes. The prototypes were obtained by sampling the joint density according to Eq. (9) and selecting the ones with the highest probability mass (64 samples per class with 512 MCMC cycles). Since the samples are close or within the typical set, they show characteristics with significant contributions to the modeled expectation (sampling was done with \(\text {JEM}\,+\,\text {MCS}\,+\,\text {EM}_{\text {Ligatures}}\))

In general, the considered classification task works on a rather standardized type design where variance is mostly contained in smaller local details. This variance includes different amounts of ink along the centerline of the character, age-related degradations (e. g., smears, smudges), deviations between different metal types due to wear, and various types of allographs and their defining typographic properties (e. g., contraction signs like overline or curved abbreviation stroke). These characteristics are well captured across several generated images. Moreover, for characters like ‘t’ with curved abbreviation stroke (Latin abbreviation for -tur or less often -er), we observe reoccurring visual features like a small part of the adjacent letters, indicating that the model successfully captures semantics in the margins of the glyph images.

6 Discussion and conclusion

Contemporary OCR frameworks are a promising approach to extracting individual glyphs from a large corpus of digital reproductions of early printed documents. However, because they are designed for high-performance transcription, their ability to accurately localize and identify the constituting glyphs is limited. Unfortunately, rule-based detection methods are equally problematic as they often fail in the presence of artifacts introduced by the physicality of the printing process, age-related wear, or typographic decorations. With this in mind, it is desirable to prune any falsely recognized glyphs and correct for potential miscategorization.

To facilitate downstream finetuning we use a joint energy-based model that allows the simultaneous classification of glyphs into a set of character classes and separation between ID images and OOD samples often wrongly recognized by OCR frameworks. We show that this approach matches the classification performance of its purely discriminative counterpart and improves the OOD detection rate by a substantial margin without adding to model complexity.

One of the main ideas is to leverage prior knowledge of certain properties of OOD data that interact with our system. Using hard negative cases of rubrications, lombards, empty spaces, or very nuanced ligature crops, we can actively shape the energy surface and facilitate a more purposeful categorization of the glyphs. Fortunately, JEMs grant us extremely simple access to steer and modulate the distribution estimation by introducing a penalty margin on the energy surface that spreads apart ID and OOD regions. We see that this not only helps to optimize the energy landscape but also leads to diffuse predictive probabilities for the OOD samples across all classes. Despite empirical evidence of the advantages of this measure, we also identify some fundamental difficulties. OOD samples might share most of their visual characteristics with ID samples. In our scenario, this is especially the case for ligatures which mostly differ from the target stand-alone characters in very intricate linking patterns at the margins. From a theoretical perspective, this drastically increases the difficulty for the model to establish a sufficiently large difference in the energy scores. The model can only correct this by inducing very steep gradients into the model distribution, which might impact model stability or convergence of optimization.

A second problem occurs if different types of OOD data are penalized using the same separating energy margin. As we have shown empirically, this leads to an overall decrease in most performance metrics. Similar to the difficulties mentioned above, different OOD data with potentially contrasting characteristics are encouraged to concentrate into more narrow densities. In such scenarios, it might be viable to assign slightly different energies based on the visual similarity to the ID data.

As another valuable result, we observe that the classification performance and OOD detection rate are very robust to common image degradation. Given the partly generative nature of JEMs, the model can recover partial or corrupted information under the learned distribution. This property is especially important for the analysis of early printed material, which often contains a huge amount of artifacts. At the same time, the empirical ablation experiments are based on in-silico artifacts and, therefore, only allow a rough estimate of the actual generalizability to the many nuances in real historical prints.

6.1 Limitations

Currently, method evaluation is done on a single digital reproduction only. This might seem a manageable problem at first, as incunables comprise only a relatively small percentage of the total body of letterpress prints with moderate variance in fonts and typographic styles. However, their digital analysis must deal with a tremendous variety of scanner types, acquisition protocols, and the state of conservation which differ between libraries, editions, copies and individual reproductions. It is, therefore, indispensable for a widely usable application to either train very specialized models or to use a single model operating as generally as possible. Furthermore, the proposed fine-tuning technique at its core implements a data-driven pruning algorithm to extract only the most probable images and discard implausible ones. If the downstream task requires (near-) full coverage, this rather aggressive approach might not be appropriate. In this case, optimization efforts should concentrate on building more robust OCR models, albeit requiring additional laborious transcription work. Lastly, during method development, we observed that larger CNNs like the Wide Residual Network [72] or the Vision Transformer [15] typically converge to odd glyph shapes. Similar behavior was already described by Nijkamp et al. [54] and Du et al. [19], confirming the generally known stability problems of EBMs [20]. With this in mind, we want to point out to the reader that there are also alternative options for training EBMs that could help circumvent these issues (cf. [27]).

6.2 Future work

Apart from exploring the method’s applicability to new types and compositions of data, we see strong merit in investigating different and possibly less ‘invasive’ techniques to integrate prior knowledge of OOD data. It would be particularly interesting to encourage the model to optimize toward a joint distribution of ID and OOD data without externally enforcing specific energy levels. For the application to early printed documents with often very localized variations in shape, this direction might prove beneficial. Furthermore, we want to investigate compositional patterns of specialized EBMs using different glyph concepts. As shown by Du et al. [16], we can use concept negation to learn a model distribution over data that does not include certain concepts, which might be viable for the goal of OOD detection.