Explainable dating of greek papyri images

Pavlopoulos, John; Konstantinidou, Maria; Perdiki, Elpida; Marthot-Santaniello, Isabelle; Essler, Holger; Vardakas, Georgios; Likas, Aristidis

doi:10.1007/s10994-024-06589-w

Explainable dating of greek papyri images

Open access
Published: 11 July 2024

Volume 113, pages 6765–6786, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Explainable dating of greek papyri images

Download PDF

601 Accesses
Explore all metrics

Abstract

Greek literary papyri, which are unique witnesses of antique literature, do not usually bear a date. They are thus currently dated based on palaeographical methods, with broad approximations which often span more than a century. We created a dataset of 242 images of papyri written in “bookhand” scripts whose date can be securely assigned, and we used it to train algorithms for the task of dating, showing its challenging nature. To address data scarcity, we extended our dataset by segmenting each image into its respective text lines. By using the line-based version of our dataset, we trained a Convolutional Neural Network, equipped with a fragmentation-based augmentation strategy, and we achieved a mean absolute error of 54 years. The results improve further when the task is cast as a multi-class classification problem, predicting the century. Using our network, we computed precise date estimations for papyri whose date is disputed or vaguely defined, employing explainability to understand dating-driving features.

Explaining the Chronological Attribution of Greek Papyri Images

Greek Literary Papyri Dating Benchmark

A Computational Approach of Armenian Paleography

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

No autographs of classical Greek authors survive today. Our knowledge of such works (along with post-classical literature and the first Christian works including the New Testament) relies on manuscripts postdating the original compositions. Of these, the most chronologically proximal are a few thousand papyri excavated mainly in Egypt in the last two centuries. Due to physical damage, these papyri usually preserve only small portions of the texts in question unlike medieval manuscripts which tend to transmit them in full-length, but both papyri and manuscripts represent copies of copies of the original works.

1.1 Background

Despite their fragmentary nature, papyri are crucial witnesses for innumerable texts, not to mention that they occasionally preserve literary works that would be otherwise lost. They are also invaluable evidence for our understanding of book culture in Antiquity, as well as for philology, the evolution of writing scripts and book production. One of the most important aspects of such research is to determine the date of the papyri involved.

Unlike their documentary counterparts (i.e. papyri preserving official and everyday documents), literary papyri bear no date before the introduction of colophons in the Middle Ages (9th century CE). We customarily employ palaeographical methods to assign an approximate and broad (often spanning more than a century) date for their production. Apart from their content, the two categories, documentary and literary papyri, are also usually written in distinctly different scripts: unformal cursive writings for the former as opposed to elegant bookhands for the latter. There are some exceptions on both sides, i.e. literary texts written in cursive and documentary texts with surprisingly elegant scripts. To this day, we lack an exhaustive list of the first category (literary texts in cursive script) which does not allow us to use the numerous dated documents to date these literary papyri by script comparison. However, a substantial number of specimens of the second group (documentary texts written in bookhand) have been included in the Collaborative Database of Dateable Greek Bookhands (CDDGB) along with literary texts (see below). Palaeographers rely on the evidence-backed assumption that handwriting styles are typical of certain periods and change over time, much like fashions and trends in anything else. The subjectivity and authoritativeness of these methods are increasingly acknowledged among scholars (Mazza, 2019; Choat, 2019; Nongbri, 2019, 2014) and further assistance for more reliable and/or accurate ones is highly desirable.

In traditional dating, papyrologists employ comparative dating. They use the-admittedly very few-objectively dateable papyri specimens to draw comparisons with non-dated ones and estimate the latters’ place on a notional timeline. The comparison is performed on the basis of the form and features of single letters, or the script overall, also used for other palaeographical tasks such as identifying scribes or classifying styles and types of scripts. The characteristics used for such studies may focus on size (small/large, short/long), shape (round/angular), specific parts of letters (arches/loops/serifs/decorations), speed of writing, ductus (the number, directions and sequence of strokes required to draw a letter), formality, etc. Although the same features are regularly invoked by many palaeographers, each researcher is free to focus (and they often do so) on every conceivable aspect of the writing. Hence, there is no formally established methodology, set of features to be taken into account, or even terminology that managed to reach consensus (Stokes, 2009, 2017). Even for the commonly used and agreed upon features, it is rarely possible for scholars to measure them or objectively calculate their significance towards a conclusion. Research in digital palaeography quantifying script features such as angle and direction of writing [for instance (Brink et al., 2012; Bulacu et al., 2003)] usually provides one such feature as the base for performing computationally palaeographical tasks. In our study, we aim at performing such a task (in this case dating) without any input in the form of human-perceived features. Instead, we attempt to identify any clues or features that lead our models to a specific date for a papyrus image.

Carbon dating is a destructive method and difficult to perform to any single papyrus (due to high costs, material damage, etc.), let alone on a large scale. Precision is also at stake, because more than one date spans can be suggested (Sect. 8.2). Furthermore, it is a technique that dates the papyrus, not the text written on it, which may predate the manuscript by many years. Palaeographical methods, on the other hand, require a wide range of potential comparanda (easily accessible for consultation) to date a papyrus, and they are subjective. In contrast, computational image dating can be performed at scale in a non-destructive manner. Also, it can serve as an auxiliary tool for the expert, reducing subjectivity in palaeographical dating by avoiding the ad hoc use of script features. The computer can also pinpoint areas of the images, which push predictions towards either extreme, and/or alter these images (and predict the corresponding date) in a controlled manner. Nevertheless, it cannot provide explanations in real-life terms, nor identify features perceivable by humans. At the same time, human experts instinctively date scripts in terms of certain characteristics, however subjective, but are unable to measure the significance of each such feature towards assigning a date. In this preliminary examination, our aim is to detect patterns (not necessarily semantically clear at this stage) in the application of saliency maps.

1.2 The contributions of this work

C1.:: We developed a dataset of images of Greek papyri from Egypt, along with the dates assigned to them by experts, which we release in two variants: one with whole papyri fragments; the other with lines of writing extracted from the full-size images.
C2.:: We proposed a Convolutional Neural Network (CNN), which we call fCNN, that is grounded in a fragmentation-based augmentation strategy and which predicts the date of text-line images with a mean absolute error of 54 years, using a regression head, and a macro-average F1 of 61.5%, using a classification head, setting the state of the art for Greek papyri image dating.
C3.:: We used fCNN to precise the dating of the lines of papyri, whose previous dates are disputed or vaguely defined, and suggested using explainability as a method in order to explore dating-driving features. All our data, code, and predictions are publicly available: https://github.com/ipavlopoulos/palit.

2 Related work

Although researchers have suggested algorithms for the automated segmentation of papyri images to text-lines (Papavassiliou et al., 2010), and although the benefits of text-line segmentation are already known in the field of writer identification (Cilia et al., 2021), no published work to date has investigated dating computationally Greek literary papyri by focusing on text line images. The baseline is set by a CNN that is fed with entire Greek literary papyri images and which achieved a mean absolute error of more than a century (Paparrigopoulou et al., 2023). Our study shows that data segmentation into text lines leads to a much smaller error, with augmentation-enhanced CNNs providing the best-performing solution. In absence of other related work for Greek papyri image dating, we summarise in the following section the published work regarding dating in general (Omayio et al., 2022).

2.1 Image-based regression

CNNs outperform approaches based on feature engineering on writer identification (Nguyen et al., 2019; Fiel & Sablatni, 2015) and similar findings are reported in dating. In Hami et al. (2019), the authors used pre-trained CNNs to date images of medieval Dutch charters from 14th c. to 16th c. CE, by focusing on image crops. The authors reported a mean absolute error of 10 years, a number beyond our reach with papyrus data where an approximation of 50 years is accepted. Regression using pre-trained CNNs on random crops was also suggested in Wahlber et al. (2016), for the dating of medieval Swedish charters. Besides feature extraction with deep learning, earlier work approached the task with regression on top of extracted features, such as scale-invariant (He et al., 2016) or hinge and fraglets (He et al., 2014).

2.2 Dating from other modalities

Besides images, other modalities have also been used as input. In Kuma et al. (2012), for example, textual features were used to infer the date. Although reasonable in general, this is not a feasible approach for Greek literary papyri and manuscripts, the content of which (literary works) may have been composed several centuries before their actual copy on papyri (for instance, a papyrus penned in the 4th century CE that contains Homer’s work probably composed around the 8th c. BCE). A different approach was suggested in Rahiche et al. (2020), where ordinal classification was combined with multispectral imaging, tracking spectral responses of iron-gall ink (of historical letters, 17th–20th c. CE) at different wavelengths. Although rich, this data representation is very expensive in time and resources to establish, especially for fragile material, which also explains why datasets in this form are very rare. Besides, papyri are mostly written with carbon-based and not iron-gall ink, which is to the present more difficult to date.

3 Data

3.1 The nature of the papyri

As already mentioned, papyri bearing literary texts do not carry a date and, for the vast majority of them, papyrologists assign a date based on the affinity of their script with objectively (not palaeographically) dated specimens. These specimens, referred to as ‘objectively dated’ ones, are dated using external indications (not contained in the literary text on the papyrus) (Turner, 1987a). Occasionally, it is archaeological evidence or even radiocarbon dating suggesting a more secure date. Most importantly, papyri were often re-used after they exceeded their lifespan and literary texts are often found on papyri that have dated documents on the back side.

3.2 Digitised papyri

The images included in our dataset come from a number of collections and online resources, whereas five or six of them were scanned from images in printed volumes. Their digitisation took place during a period of more than two decades, under substantially different imaging protocols. As a result, they vary greatly in their properties, most importantly in scaling to actual size, colour capturing, resolution and bit depth. For a few of them, it was not possible to extract text lines, due to very low resolution, and they returned empty files during the segmentation stage.

3.3 Our new dataset

Our dataset comprises images of Greek papyri from Egypt and their respective dates, from the 1st to the 4th c. CE.^{Footnote 1} Images of papyri from other centuries were few, hence we did not consider them in this study. The papyri included were selected (i.e., excluding non-Greek, documentary, and ones with an unavailable image) from CDDGB, the only available collection of (somewhat) securely dated literary papyri, which includes also a few documentary texts in bookhand. The data it contains can be dated based on various objective dating criteria, such as the presence of a document that contains a date on the reverse side, internal evidence in the text (mostly for the few documentary ones and the 9th c. CE manuscripts having colophons), radiocarbon dating, or a dateable archaeological context associated with the manuscripts. In the CDDGB database, most records contain sampled images and we had to manually trace full-sized ones from the respective collections. We release our dataset in two forms, one where images contain whole fragments and one where they contain text lines.

The Papyri Literary Fragments (PLF) dataset consisted initially of 242 images of publicly available papyri fragments from the 1 BCE to 9 CE. As can be seen in Fig. 1, most fragments come from the 2nd or 3rd c. CE, followed by the 9th and the 1st c. CE. When multiple fragments of the same manuscript were available, we included all of them. The date provided for most fragments is not specific. Typically, the minimum date range assigned to a literary papyrus spans 50 years, but it may reach up to two centuries. Most often, the latter cases concern a date between the two most frequent centuries (i.e., ‘2–3’ in Fig. 1). Our study focused on the first four centuries, from 1 to 4 CE, comprising 168 images of literary papyri; reduced to 159 by removing nine empty images from 2 and 3 CE. The final distribution across the four centuries (1–4 CE) was 20, 61, 60, and 18, respectively. We converted our images to grayscale to reduce dimensionality and to facilitate machine learning experiments. A potential bias regards dates from reuse (more than 40% of our data) as this method provides only one boundary of the interval whereas the other needs to be estimated.

The Papyri Literary Lines (PLL) dataset extends PLF so that images of the text lines of the fragments are provided instead of images of the whole fragments.^{Footnote 2} The 159 images were segmented automatically using the Transkribus HTR platform,^{Footnote 3} yielding 4,655 line images (i.e., approx. 30 lines per fragment). For this segmentation step, we used the default settings in Transkribus and did not train a specific baseline model, due to the multiformity of our material. We interfered minimally, by manually correcting text regions where none or very few lines were captured in the automatically generated segmentation. We also manually corrected a small number of baselines and line regions (appr. 1–2%), when no or insignificant amount of writing was captured, or when substantial and useful writing areas were obviously excluded. Even so, a considerable number of possibly useful lines were not added and in several cases, the automatic segmentation captured multiple lines in an instance, or a substantial amount of background with minimal writing. Also, we did not eliminate lines with noise, such as damaged papyrus surface, gaps in the writing material (holes), and lines bordering the edge of the papyrus. As a result, several line images still contain noise, which means that the dataset would benefit from more interventional curation.

The balance of the centuries in PLL followed that of PLF, with 439, 2,116, 1,797, and 303 images from the 1st, 2nd, 3rd, and 4th c. CE, respectively. The scatterplot of PLL in Fig. 2b, shown aside that of PLF in Fig. 2a for comparison purposes, shows that most of the images have a height of 50 pixels or more, and a diverse width, perhaps explained by the fact that lines comprise texts of various lengths, from a single word to more than ten. We filtered out low quality images, with a height lower than 50 pixels or width less than 300 pixels, which resulted in 2,774 images in total (40% reduction). We release the annotations of the images, along with the URLs of the images, and a script to download the data.^{Footnote 4}

4 Method

Our method, called fCNN, is a 43 m-parameter CNN that exploits augmentation so that it is robust to fragmented input, often met in papyri.

4.1 fCNN

The network consists of two Conv2D layers to represent the image of each text line, of 32 and 64 channels, respectively, followed by a 3-layer feed-forward neural network (FFNN). We used a convolutional kernel of size 5, single stride, zero padding, max-pooling (2x2), batch normalization, and ReLU activations. The FFNN receives a flat representation from the Conv2D, reduced to 1,024, then to 512 neurons. One output neuron yields the date in fCNNr and four followed by softmax in fCNNc. A ReLU activation function, batch normalization, and dropout of 0.1 is added per layer.

4.1.1 Synthetic fragmentation

Synthetic fragmentation is a possible augmentation channel during training. Papyri are very often fragmented, leading to partial information in the image to be dated. We exploited this pattern as part of our augmentation strategy by erasing randomly (0.5 probability) image fragments, setting their pixel values to 0.5. Images were also transformed with Gaussian blur (kernel size of 3), random affine (up to 3 degrees), as well as cropped and resized (by keeping the 1:6 aspect ratio).^{Footnote 5}

4.2 The baseline

We used the state-of-the-art regression, which is achieved by ensembles (Fernández-Delgado et al., 2019), including Extremely Randomized Trees (XTR) and XGBoost (XGB). We experimented with both these regressors, using patches of 50*300 windows cropped from the center of each image, which was also represented with PCA-extracted 500-dimensional features. In our preliminary experiments, PCA led to better results compared to image binarisation using Canny edge detection and Otsu, which have been reported beneficial in writer identification (Nasir & Siddiqi, 2021). We used the implementation provided by sklearn setting all hyper-parameters to default values, besides the objective of XGB, which was set to the squared error.

5 Experiments

We experimented on the PLL dataset, using as input the text-line images and as output the dates of the respective papyri. Our experiments include casting the problem as a classification task, by predicting the century as one out of four labels.

5.1 Experimental details

We used Adam optimisation (Kingma & Ba, 2014) with a learning rate of 1e–3, batch size of 16, 200 epochs, and early-stopping with patience of 20 epochs. We used 2218 instances for training, 278 for testing, and 278 for validation. The regression variant was trained with a mean squared error loss and the classification variant with a cross entropy loss. We used PyTorch and we released our code in our repository.^{Footnote 6}

5.2 The benchmark

A majority baseline (BLM), which always predicts 2 CE, achieved an MAE of approx. 0.632 and an MSE of 0.772. XTR and XGB perform better than this weak baseline, with a considerable difference when looking at MSE. The latter penalizes greater distances more, which means that the papyri of the 1st and 4th c. CE were better handled by XGB and XTR. Our fCNNr performs considerably better than all the baselines, achieving an average absolute error of 54 years.

Table 1 Mean absolute and squared error of dating along with their standard error of the mean in parenthesis

Full size table

5.3 From regression to classification

By rounding the predictions of our fCNNr, we created a confusion matrix, which is shown in Fig. 3a. Confusion regards mainly neighboring centuries. The model correctly detects images from 2 CE and 3 CE while images from 3 CE may be predicted close to 2 CE, and vice versa. Difficulties in dating regard the two edges because 1 CE and 4 CE are more often predicted as 2 CE and 3 CE, respectively.

Although our task in hand is a regression one in nature, we also trained and assessed a classification variant (fCNNc), which learns to disregard the order of centuries and simply treat them as labels. In Fig. 3b, we observe that results improve across all centuries except 4 CE, where the difficulty remains approximately the same. This observation can be seen also in Table 2, which shows the F1 per century per fCNN variant. In this table, we also observe that a trained XGB classifier is better than an XTR one, but both fail compared to fCNNc.

Table 2 F1 per century of fCNNr (rounded predictions), fCNNc (along with the absolute difference compared to fCNNr), XGB, and XTR classifier

Full size table

Despite the fact that both fCNN variants are trained on the same data, we note that we do not consider them as competitors. The regression-based fCNNr suggests a date, which can provide a very rough estimation of when the papyrus was written. If the predicted date was 280 CE, then this is an indication that the papyrus is dated between 3 CE and 4 CE, and that a year close to the latter is likelier. On the other hand, the classification-based fCNNc suggests a century and yields a score to indicate its confidence. If the predicted century was the 4th CE and the confidence was 80%, then this means that the network is confident that the date is the 4th c. CE and no other. Although our task at hand is one of regression, both can generate useful explanations. Therefore, since our end goal is to assist and not supplement the expert, we used them both in our explainability study, discussed next.

5.4 Explainability

Saliency maps (Simonya et al., 2013) reveal the parts of the image that are responsible for the network’s prediction. We experimented with both variants, fCNNr and fCNNc, and we used both, gradient- and perturbation-based attribution. In this study, we opted for fCNNc using gradient-based attribution, but we observe that explanations by the two variants can be combined to yield richer explanations.

We computed one heatmap per predicted line and we present a random sample of lines in Fig. 4. The heated colours show that the network consistently focuses on the letters in order to yield its predictions for the date. This means that the model is basing its prediction on the shape of specific letters, the distance between them, the size, or the intensity of the ink. By contrast, it seems invariant from background noise and other attributes which may be often present in Greek literary papyri. For example, gaps (holes in the papyrus) such as those in Figs. 4a and c, do not get any attention from the model.

6 Assessing data sources limitations

6.1 Inaccuracies in dating

CDDGB is not a product of targeted research on securely dated papyri, but rather a compilation of such examples mentioned in other papyrological works.^{Footnote 7} Hence, the collection is not comprehensive and the data included is not meticulously assessed by the compilers. Shortcomings concern the accuracy of some dates. Still and all, it is the same data of objectively dated papyri that papyrologists use as a reference for palaeographical dating. In this study, we introduce the computational factor in assessing scripts in connection with their assigned dates. Also, by focusing on the explainability of dating images of handwritten text, we do not consider these shortcomings detrimental. The possible inaccuracies in dating and the wide range of the assigned dates do not affect the explanations, which aim to provide pointers on features of the script.

6.2 Class imbalance

The imbalance in the size of the fragments and quantity of lines is an inherent issue owing to the nature of the available material. A papyrus may contain three or four usable lines, whereas others may have more than fifty. This does not affect dating significantly because, although test lines may come from a manuscript not hidden during training, each line constitutes a completely different image pattern. The same issue could be an advantage regarding explainability because possible features are brought out in a more controlled manner when multiple lines of the same manuscripts are involved. While some features, especially palaeographically insignificant characteristics, remain consistent (such as colour/intensity of the ink, texture and colour of the background, general size of script, scale, etc.), explanations can focus on pivotal ones.

6.3 Fragment leakage

Our train and validation subsets are mutually exclusive at the line level but not necessarily at the fragment level. Although the former is straightforward, the latter is not due to the diversity of lines in the fragments. Figure 5 shows this diversity, as well as the fact that most fragments comprise only a few lines. In other words, fCNN may have learned to detect the fragment instead of the century of the line. Although this would set potentially an interesting finding, this is not the scope of this study. Hence, we experimented with two different dataset split strategies, in order to reject this hypothesis (the number of lines per dataset per split is shown in Table 4 of Appendix A).

6.3.1 The Modulo-13 split

We kept lines from papyri with zero remainder when the number 13 divided their index for validation and testing (in half), keeping the rest for training. This strategy introduces a slight distribution drift, with relatively fewer lines from 1 CE during testing and more from 4 CE.

6.3.2 The few-lines split

We sorted the fragments based on the number of lines of each fragment. Then, we used the lines from the ones with the fewest lines (the rightmost ones in Fig. 5) to form the test set. We followed a stratified sampling based on the distribution of the centuries at the train data. For validation, we used fragments with few lines, one per century, combined with instances sampled from the train data to avoid a distribution drift at the century level and to assist the learning process.

6.3.3 Empirical analysis

The F1 per century of fCNNc, when it was re-trained and assessed according to the two new split strategies, are shown in Table 3. Both strategies have merits compared to the default, with modulo-13 being better than both few-lines and the default strategy. Based on these results, we reject the hypothesis that fragment leakage could drive the promising results presented in this study.

Table 3 F1 per century of fCNNc per split

Full size table

7 Error analysis

To go further in our understanding of the relevance of our experiment, we provide in this section an error analysis, followed by an experiment to understand the way the model handles damaged papyri by ablating input images before dating.

7.1 Analysis

By studying fCNN’s deviations from the ground truth, we observed that these concerned predictions are toward the neighbouring century. Images from the two edge centuries, 1 CE and 4 CE, are scored up to 2 CE and 3 CE, respectively, the two most frequent centuries (Fig. 1). Images from 2 CE and 3 CE, on the other hand, were scored not far from each other, most often to 3 CE and 2 CE, respectively. By looking at the saliency maps of the misclassifications, we observed that letter-shaped noise, present in the source images, received the model’s focus.

7.2 Ablation

Our error analysis revealed that fragments may deceive our model. In order to investigate the model’s sensitivity, we fed fCNNc with test images, augmented with randomly-shaped black and white patches. We observe that the model’s focus changes according to the colour of the patch. White boxes appear to be disregarded by our model, in contrast to black boxes, which are receiving attention. An example is shown in Fig. 6, where the same line from a papyrus of the 3rd c. CE is altered in two ways. In Fig. 6b, the focus is everywhere except from the white patch. This is in line with our findings about breaks, which are also depicted in white in the papyri images (Fig. 4). By contrast, the black patch of Fig. 6a affects the prediction as if the model is guessing a missing character and as if the black colour of the patch was ink.

7.2.1 White patches

The addition of white patches on the images harms the performance fCNNc. The F1 per class dropped for each class, yielding a drop in macro-averaged F1 of 11 points overall (from 0.58 in macro-F1 to 0.47).^{Footnote 8} By hiding parts of the image, important information is lost. In other words, fragments hinder the model in its prediction task.

7.2.2 Black patches

The performance dropped significantly more when the patch was black (from 0.58 to 0.22 in macro-F1, using the same model). Although an equal portion of information was lost from the image, compared to the white patches, in this case, the model attempted to fill the black gap, guessing the missing context. This was shown in Fig. 6a but can be more clearly observed in the examples of Fig. 7, where each black patch is annotated with a red plate. Outside the patches, high temperature appears on pixels with ink and handwriting. Inside the patch, however, letter-looking structures are revealed in black (with low temperature) on a high-temperature background, as if the network focused on the letter’s contour and the space between the (guessed) letters. This guessing pushed fCNN to predict more often the 3rd c. CE, as if the reconstructed part of the image (the black patch) is common in papyri of that century, steering accordingly the prediction.^{Footnote 9}

8 Dates in doubt: A computational estimate

fCNN can accurately predict the date of a text line image (Table 1) and, when the task is simply to predict the century and not an exact date, a classification variant that ignores the temporal relation of the labels yields even better results (Table 2). As was shown from our study of saliency-based explanations, fCNN focuses on the letters, that is the foreground and not the background (e.g., the blank parts of the papyrus sheet, the fibres, the holes and damages). In order to provide the experts with suggestions that could possibly improve the current dating,^{Footnote 10} we apply this network to loosely dated texts (across two centuries).

In our primary source, 11 papyri are dated either to 2 CE or 3 CE. Using fCNNc, we found that 87% of all the lines in these papyri are classified as 2 CE or 3 CE. Exceptions are 16 lines classified as the 1st c. CE and one that was classified as 4 CE. Figure 8 presents the analytical results.

Using fCNNr, we attempted then to estimate a more precise chronology for the lines in these papyri. Despite the fact that our regressor was trained on ground truth at the century level, our expectation is that it will have learned to yield a chronology that is closer to the objective date. Figure 9 presents our predictions, organised per papyrus. The predicted dates for the lines of P.Oxy 3005, classified by fCNNc as 3 CE, are diverse, with the majority falling on the late 2nd and early 3rd c. CE. Overall, our network’s estimations agree with the range provided by the experts. The earliest prediction was 98 CE, for a line in P.Oxy. 661. This papyrus comprises parts of a poem by Callimachus and is dated from 150 to 250 CE,^{Footnote 11} with the first editor arguing that it is the late 2nd CE.^{Footnote 12} On average, our predictions suggest 200 CE, but some lines are predicted as early as 100 CE while others as 250 CE. The latest prediction is 270 CE for a line in P.Flor. II 120,^{Footnote 13} dated from 250 CE to 261 CE. In this papyrus, in very few lines our predictions agree with the experts because on average our network dates it before 200 CE. In P. Oxy. 4560, only one line is used, and the date is 100 CE. In P. Oxy. 232, although lines are few, all our predictions date the papyrus between 100 CE and 150 CE.

8.1 P.Oxy 852: taking a stand

The first editors dated fragments of Euripides’ Hypsipyle (P.Oxy 852) between the late 2nd and early 3rd c. CE.^{Footnote 14} The literary text is written on the back of a reused scroll, containing an account that may be dated to 106 CE.^{Footnote 15} How long papyrus scrolls were preserved before being reused is subject to discussion and examples range from a couple of years to more than a century (Turner, 1978b). Our fCNN models agree with the editors, favouring late 2nd instead of early 3rd c. CE.

8.2 P. Bodmer XXIV: radiocarbon versus Paleography grounds

Based on radiocarbon dating, the resulting ranges of calendar dates for P. Bodmer XXIV were found to be 231-261 CE or 277-339 CE. These ranges were consistent with the palaeographic estimates of Kasser/Testuz & Turner on the 3rd or 4th c. CE (Stevens, 2023), but were casting doubt on the 2nd c. assigned by Roberts.

We segmented each image of the papyrus to its 170 lines.^{Footnote 16} We discarded 25 lines because of their small size,^{Footnote 17} and fed the resulting 145 images to our fCNNc and fCNNr. The former predicted 121 lines from the 145 as 2nd c. CE, 13 as 1st and 11 as 3rd. The predictions of fCNNr, shown in Fig. 10, agree with this outcome. Hence, our fCNN agrees with both Roberts (more frequently) and Kasser/Testuz & Turner, probably picking up various characteristics of the script and the script variations that evolved through the centuries before becoming the norm. By leveraging explainability as a method, we study these characteristics further below.

8.3 XaM: explainability as a method

A significant portion of our training data is labelled for date relying to some degree on human estimations derived from palaeography. It is likely that the resulting explanations reflect the features that mostly contributed to date predictions, as perceived by both the model and the humans involved. Towards this goal, an accurate date prediction becomes less relevant and the models are repurposed, setting the lens to focus on the features that drive the predictions instead of the predictions themselves. We call this approach explainability as a method (XaM) and discuss it further below.

8.3.1 Ablating the input: disegni

Disegni are manual script reproductions established by drawing the papyrus letters on tracing paper with various writing implements, also achieving optimal binarization. Figure 11b shows a disegno we produced for P.Bodmer XXIV, originally shown in Fig. 11a, with ink splotches naturally being formed where the pen rests for a while. The disegno of Fig. 11c attempts to reduce these splotches by not resting the pen, while that of Fig. 11d further eliminates areas of high ink concentration by avoiding “letter shading”, i.e., variation in thickness or darkness of strokes within handwritten letters, due to differences in pressure applied by the writer’s hand, the angle of the writing implement, or the type of writing instrument used. Although we focused on P.Bodmer XXIV (Sect. 8.2) here, we note that our XaM-based approach is applicable to other use cases, scripts, and languages given data availability (Dhali et al., 2020; Boudraa et al., 2024).

8.3.2 Driving features of date predictions

Fig. 12 shows the century probabilities of fCNNc (using softmax) for the original image and the disegni shown in Fig. 11. The prediction is 2nd c. CE for both the original image and the disegno with full shading. However, as shading is reduced, the prediction of the network gradually moves from 2nd c. CE to 3rd c. CE, a result that agrees with the carbon dating results of P.Bodmer XXIV (Sect. 8.2). The saliency maps of fCNNc (shown in Fig. 13 of Appendix C) show that explanations focus on greater concentrations of ink, with a strong preference to the intersections of strokes, as well as to the curvature of some others. These initial findings indicate that experimenting with controlled alterations of line images could produce more interpretable explanations and reveal decisive for dating script features. We discuss XaM further in Appendix C.

9 Conclusions

This work introduced a dataset of images of Greek literary papyri with whole papyri fragments (PLF) and with lines of writing (PLL). Our experiments showed that an augmentation-enhanced CNN predicts the date of text-line images with a mean absolute error of 54 years, using a regression head, and a macro-average F1 of 61.5%, using a classification head, setting the state of the art for Greek papyri image dating. An explainability study revealed that fCNN clearly focuses on letters to predict the date, following the palaeographer’s path. Also, by masking the images prior to dating, we showed that black patches (as in ink) affected the model’s performance considerably more compared to white patches (as in writing surface) and that fCNN explanations on these black patches appear as letter-looking structures. Finally, we applied fCNN on two papyri, whose dates have been debated among palaeographers, taking a stand and showing that dating-driving features can be revealed by leveraging explainability as a method. To this end, we used manually altered versions of the images in question, yielding improved dating predictions and more interpretable explanations.

Data availibility statement

We provide a script to download the data in our repository. Permission to use the images of Sect. 8.3 has been granted by the Museum of the Bible.

Notes

There is a very small number of exceptions which reflect the complexity of our documentation: one text is in Coptic, a few do not come from Egypt but the Near East and another few are written on parchment, not papyrus. In this study, we collectively call them ‘papyri’.
Alternative options exist (Appendix B), but we consider that segmentation to lines best fits our task.
https://readcoop.eu/transkribus/.
https://github.com/ipavlopoulos/palit.
The actual letter size, image ratio, lightning conditions (when papyri were captured) greatly vary in our dataset, which is what drove our decisions on augmentation strategies to better enhance model robustness.
https://github.com/ipavlopoulos/palit.
More reliable compilations are promised by current projects, but are still only work-in-progress.
We report the results for the model that is based on the few-lines strategy.
The Recall for the 3rd c. CE increased from 0.71 to 0.88 while Precision decreased from 0.67 to 0.48. No predictions were made for the 1st and 4th c. CE.
Datings usually come from the first editor of the text. Sometimes, another expert (editor) makes a case that the dating should be modified and the correction may be accepted or provided as alternative dating.
https://www.trismegistos.org/text/59375 (accessed: May 25, 2023).
The Photographic Archive of Papyri in the Cairo Museum (accessed: May 25, 2023).
https://papyri.info/ddbdp/p.flor;2;120 (accessed: May 25, 2023).
https://digital.bodleian.ox.ac.uk (accessed: January 9, 2024).
https://papyri.info/ddbdp/sb;20;14409 (accessed: December 27, 2023).
We used Transkribus, as described in Sect. 3.3.
Images of lines that had a width that was less than 300 or height less than 50.

References

Boudraa, M., Bennour, A., Al-Sarem, M., Ghabban, F., & Bakhsh, O. A. (2024). Contribution to historical manuscript dating: A hybrid approach employing hand-crafted features with vision transformers. Digital Signal Processing, 104477.
Brink, A. A., Smit, J., Bulacu, M. L., & Schomaker, L. R. B. (2012). Writer identification using directional ink-trace width measurements. Pattern Recognition, 45(1), 162–171. https://doi.org/10.1016/j.patcog.2011.07.005
Article Google Scholar
Bulacu, M., Schomaker, L., & Vuurpijl, L. (2003). Writer identification using edge-based directional features. In: ICDAR, Edinburgh (pp. 937–941). https://doi.org/10.1109/ICDAR.2003.1227797
Choat, M. (2019). Dating Papyri: Familiarity, instinct and guesswork. Journal for the Study of the New Testament, 42(1), 58–83. https://doi.org/10.1177/0142064X19855580
Article Google Scholar
Cilia, N. D., De Stefano, C., Fontanella, F., Marthot-Santaniello, I., & Freca, A. (2021). Papyrow: A dataset of row images from ancient greek papyri for writers identification. In ICPR, Virtual (pp. 223–234). Springer.
Dhali, M. A., Jansen, C. N., De Wit, J. W., & Schomaker, L. (2020). Feature-extraction methods for historical manuscript dating based on writing style development. Pattern Recognition Letters, 131, 413–420.
Article Google Scholar
Fernández-Delgado, M., Sirsat, M. S., Cernadas, E., Alawadi, S., Barro, S., & Febrero-Bande, M. (2019). An extensive experimental survey of regression methods. Neural Networks, 111, 11–34.
Article Google Scholar
Fiel, S., & Sablatnig, R. (2015). Writer identification and retrieval using a convolutional neural network. In CAIP, Valletta, Malta (pp. 26–37). Springer.
Hamid, A., Bibi, M., Moetesum, M., & Siddiqi, I. (2019). Deep learning based approach for historical manuscript dating. In ICDAR, Syndey, Australia (pp. 967–972). IEEE.
He, S., Samara, P., Burgers, J., & Schomaker, L. (2016). Discovering visual element evolutions for historical document dating. In ICFHR, Shenzhen, China (pp. 7–12). IEEE.
He, S., Sammara, P., Burgers, J., & Schomaker, L. (2014). Towards style-based dating of historical documents. In ICFHR, Crete, Greece (pp. 265–270). IEEE.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980
Kumar, A., Baldridge, J., Lease, M., & Ghosh, J. (2012). Dating texts without explicit temporal cues. arXiv:1211.2290
Mazza, R. (2019). Dating early Christian papyri: Old and new methods—introduction. Journal for the Study of the New Testament, 42(1), 46–57. https://doi.org/10.1177/0142064X19855579
Article Google Scholar
Nasir, S., & Siddiqi, I. (2021). Learning features for writer identification from handwriting on papyri. In Pattern recognition and artificial intelligence, Hammamet, Tunisia (pp. 229–241). Springer.
Nguyen, H. T., Nguyen, C. T., Ino, T., Indurkhya, B., & Nakagawa, M. (2019). Text-independent writer identification using convolutional neural network. Pattern Recognition Letters, 121, 104–112.
Article Google Scholar
Nongbri, B. (2014). The limits of palaeographic dating of literary papyri: Some observations on the date and provenance of p.bodmer ii (p66). Museum Helveticum, 71(1), 1–35.
Google Scholar
Nongbri, B. (2019). Palaeographic analysis of codices from the early Christian period: A point of method. Journal for the Study of the New Testament, 42(1), 84–97. https://doi.org/10.1177/0142064X19855582
Article Google Scholar
Omayio, E. O., Indu, S., & Panda, J. (2022). Historical manuscript dating: traditional and current trends. Multimedia Tools and Applications, 81(22), 31573–31602.
Article Google Scholar
Paparrigopoulou, A., Kougia, V., Konstantinidou, M., & Pavlopoulos, J. (2023). Greek literary papyri dating benchmark. https://doi.org/10.21203/rs.3.rs-2272076
Papavassiliou, V., Stafylakis, T., Katsouros, V., & Carayannis, G. (2010). Handwritten document image segmentation into text lines and words. Pattern Recognition, 43(1), 369–377.
Article Google Scholar
Rahiche, A., Hedjam, R., Al-Maadeed, S., & Cheriet, M. (2020). Historical documents dating using multispectral imaging and ordinal classification. Journal of Cultural Heritage, 45, 71–80.
Article Google Scholar
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034
Stevens, D. (2023). Radiocarbon analysis of six museum of the bible manuscripts. Zeitschrift für Papyrologie und Epigraphik, 227, 153–160.
Google Scholar
Stokes, P. A. (2009). Computer-aided palaeography, present and future. In: Rehbein, M., Schaßan, T., Sahle, P. (Eds.), Kodikologie und Paläographie Im Digitalen Zeitalter - Codicology and Palaeography in the Digital Age (Vol. 2, pp. 309–338). BoD, Norderstedt. https://kups.ub.uni-koeln.de/2978/
Stokes, P. A. (2017). Scribal attribution across multiple scripts: A digitally aided approach. Speculum, 92(S1), 65–85. https://doi.org/10.1086/693968
Article Google Scholar
Turner, E. G. (1987a). Greek manuscripts of the ancient world.(P. J. Parsons, Ed.; Revised and Enlarged ed.). Institute of Classical Studies, London.
Turner, E. G. (1978b). Writing materials for businessmen. Bulletin of the American Society of Papyrologists, 15(1), 163–169.
Google Scholar
Wahlberg, F., Wilkinson, T., & Brun, A. (2016). Historical manuscript production date estimation using deep convolutional neural networks. In ICFHR, Shenzhen, China (pp. 205–210). IEEE.

Download references

Funding

Open access funding provided by Stockholm University. This work has been partially supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program.

Author information

Authors and Affiliations

Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden
John Pavlopoulos
Athens University of Economics and Business, Athens, Greece
John Pavlopoulos
Department of Greek Philology, Democritus University of Thrace, Komotini, Greece
Maria Konstantinidou & Elpida Perdiki
University of Basel, Basel, Switzerland
Isabelle Marthot-Santaniello
Julius-Maximilians-Universität Würzburg, Würzburg, Germany
Holger Essler
Department of Computer Science & Engineering, University of Ioannina, Ioannina, Greece
Georgios Vardakas & Aristidis Likas
Archimedes/Athena RC, Athens, Greece
John Pavlopoulos

Authors

John Pavlopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Maria Konstantinidou
View author publications
You can also search for this author in PubMed Google Scholar
Elpida Perdiki
View author publications
You can also search for this author in PubMed Google Scholar
Isabelle Marthot-Santaniello
View author publications
You can also search for this author in PubMed Google Scholar
Holger Essler
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Vardakas
View author publications
You can also search for this author in PubMed Google Scholar
Aristidis Likas
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JP wrote the first draft and undertook the experiments; MK co-authored and contributed with conceptualisation; IMS, HE, and EP co-authored and contributed as palaeography experts; GV assisted with the experiments and machine learning expertise; AL contributed with machine learning expertise.

Corresponding authors

Correspondence to John Pavlopoulos or Maria Konstantinidou.

Ethics declarations

Conflict of interest

‘Not applicable’.

Ethics approval and consent to participate

‘Not applicable’.

Consent for publication

‘Not applicable’.

Materials availability

‘Not applicable’.

Code availability

We publicly release all our code in our repository.

Additional information

Editors: Ana Carolina Lorena, Albert Bifet, Rita P. Ribeiro.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Data split per strategy

The number of instances per dataset subset (i.e., train, dev, test) and per split strategy is shown in Table 4. The strategy with the most training instances is modulo-13, followed by few-lines and the initial one (default).

Table 4 Number of instances per data subset, per split strategy

Full size table

Appendix B Lines instead of patches

The PLL variant of our dataset was developed based on the text lines. This path follows the scribe’s natural writing style. It also follows the segmentation process of handwritten text recognition technology. An alternative path would be to crop multi-line rectangles. In preliminary experiments, this path deteriorated the results and was disregarded. This could be due to two reasons. First, rectangles comprise from few to many lines, introducing layout information. The latter (especially interlinear spacing) is taken into account in identifying fragments from the same scribe/book, but there is no evidence suggesting that it is a factor for dating purposes. It rather reflects the quality of the book (luxurious copies tend to have more generous spacing) and a wide range of layout formats is consistently present across all periods. Second, line segmentation homogenises the dataset by allowing only single lines to serve as input while rectangles, especially without scaling, yield a great text (quantity) variety that may confuse the model.

Appendix C Explainability as a Method

Explainability as a Method (XaM) shifts focus from simply striving for accurate date predictions to a more nuanced examination of the underlying features that influence the date estimations as interpreted by the models. As discussed in Sect. 8.3, the dates in our data are primarily based on human estimates using palaeographic techniques. This process, inherently subjective, involves skilled interpretation of ancient scripts and handwriting styles, which may introduce a level of uncertainty in the data. Furthermore, the specific features that drove the experts to the date they assigned are not explicitly stated in published literature. Hence, any disagreement between experts (c.f. Sect. 8.1) is hard to be resolved or further analysed.

With XaM, we hypothesise that a model trained on the ground truth produced by the experts can be used to estimate dating-driving features. More specifically, we suggest producing manual script reproductions (i.e., the disegni) to assess the dependence of dating on specific features. We provided an example in Sect. 8.3.1, where we produced disegni by gradually ablating letter shading (uneveness in stroke thickness) of the line in Fig. 11a, showing that dating gradually moves from the 2nd to the 3rd c. CE. Figure 13 shows the saliency maps of these disegni, along with the original line, where it is shown that explanations are focused on ink concentrations and they are more in number on disegni with full shading. A more thorough analysis of this direction is left for future work.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pavlopoulos, J., Konstantinidou, M., Perdiki, E. et al. Explainable dating of greek papyri images. Mach Learn 113, 6765–6786 (2024). https://doi.org/10.1007/s10994-024-06589-w

Download citation

Received: 29 February 2024
Revised: 09 June 2024
Accepted: 18 June 2024
Published: 11 July 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10994-024-06589-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Explainable dating of greek papyri images

Abstract

Similar content being viewed by others

Explaining the Chronological Attribution of Greek Papyri Images

Greek Literary Papyri Dating Benchmark

A Computational Approach of Armenian Paleography

Explore related subjects

1 Introduction

1.1 Background

1.2 The contributions of this work

2 Related work

2.1 Image-based regression

2.2 Dating from other modalities

3 Data

3.1 The nature of the papyri

3.2 Digitised papyri

3.3 Our new dataset

4 Method

4.1 fCNN

4.1.1 Synthetic fragmentation

4.2 The baseline

5 Experiments

5.1 Experimental details

5.2 The benchmark

5.3 From regression to classification

5.4 Explainability

6 Assessing data sources limitations

6.1 Inaccuracies in dating

6.2 Class imbalance

6.3 Fragment leakage

6.3.1 The Modulo-13 split

6.3.2 The few-lines split

6.3.3 Empirical analysis

7 Error analysis

7.1 Analysis

7.2 Ablation

7.2.1 White patches

7.2.2 Black patches

8 Dates in doubt: A computational estimate

8.1 P.Oxy 852: taking a stand

8.2 P. Bodmer XXIV: radiocarbon versus Paleography grounds

8.3 XaM: explainability as a method

8.3.1 Ablating the input: disegni

8.3.2 Driving features of date predictions

9 Conclusions

Data availibility statement

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval and consent to participate

Consent for publication

Materials availability

Code availability

Additional information

Publisher's Note

Appendices

Appendix A Data split per strategy

Appendix B Lines instead of patches

Appendix C Explainability as a Method

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation