1 Introduction

No autographs of classical Greek authors survive today. Our knowledge of such works (along with post-classical literature and the first Christian works including the New Testament) relies on manuscripts postdating the original compositions. Of these, the most chronologically proximal are a few thousand papyri excavated mainly in Egypt in the last two centuries. Due to physical damage, these papyri usually preserve only small portions of the texts in question unlike medieval manuscripts which tend to transmit them in full-length, but both papyri and manuscripts represent copies of copies of the original works.

1.1 Background

Despite their fragmentary nature, papyri are crucial witnesses for innumerable texts, not to mention that they occasionally preserve literary works that would be otherwise lost. They are also invaluable evidence for our understanding of book culture in Antiquity, as well as for philology, the evolution of writing scripts and book production. One of the most important aspects of such research is to determine the date of the papyri involved.

Unlike their documentary counterparts (i.e. papyri preserving official and everyday documents), literary papyri bear no date before the introduction of colophons in the Middle Ages (9th century CE). We customarily employ palaeographical methods to assign an approximate and broad (often spanning more than a century) date for their production. Apart from their content, the two categories, documentary and literary papyri, are also usually written in distinctly different scripts: unformal cursive writings for the former as opposed to elegant bookhands for the latter. There are some exceptions on both sides, i.e. literary texts written in cursive and documentary texts with surprisingly elegant scripts. To this day, we lack an exhaustive list of the first category (literary texts in cursive script) which does not allow us to use the numerous dated documents to date these literary papyri by script comparison. However, a substantial number of specimens of the second group (documentary texts written in bookhand) have been included in the Collaborative Database of Dateable Greek Bookhands (CDDGB) along with literary texts (see below). Palaeographers rely on the evidence-backed assumption that handwriting styles are typical of certain periods and change over time, much like fashions and trends in anything else. The subjectivity and authoritativeness of these methods are increasingly acknowledged among scholars (Mazza, 2019; Choat, 2019; Nongbri, 2019, 2014) and further assistance for more reliable and/or accurate ones is highly desirable.

In traditional dating, papyrologists employ comparative dating. They use the-admittedly very few-objectively dateable papyri specimens to draw comparisons with non-dated ones and estimate the latters’ place on a notional timeline. The comparison is performed on the basis of the form and features of single letters, or the script overall, also used for other palaeographical tasks such as identifying scribes or classifying styles and types of scripts. The characteristics used for such studies may focus on size (small/large, short/long), shape (round/angular), specific parts of letters (arches/loops/serifs/decorations), speed of writing, ductus (the number, directions and sequence of strokes required to draw a letter), formality, etc. Although the same features are regularly invoked by many palaeographers, each researcher is free to focus (and they often do so) on every conceivable aspect of the writing. Hence, there is no formally established methodology, set of features to be taken into account, or even terminology that managed to reach consensus (Stokes, 2009, 2017). Even for the commonly used and agreed upon features, it is rarely possible for scholars to measure them or objectively calculate their significance towards a conclusion. Research in digital palaeography quantifying script features such as angle and direction of writing [for instance (Brink et al., 2012; Bulacu et al., 2003)] usually provides one such feature as the base for performing computationally palaeographical tasks. In our study, we aim at performing such a task (in this case dating) without any input in the form of human-perceived features. Instead, we attempt to identify any clues or features that lead our models to a specific date for a papyrus image.

Carbon dating is a destructive method and difficult to perform to any single papyrus (due to high costs, material damage, etc.), let alone on a large scale. Precision is also at stake, because more than one date spans can be suggested (Sect. 8.2). Furthermore, it is a technique that dates the papyrus, not the text written on it, which may predate the manuscript by many years. Palaeographical methods, on the other hand, require a wide range of potential comparanda (easily accessible for consultation) to date a papyrus, and they are subjective. In contrast, computational image dating can be performed at scale in a non-destructive manner. Also, it can serve as an auxiliary tool for the expert, reducing subjectivity in palaeographical dating by avoiding the ad hoc use of script features. The computer can also pinpoint areas of the images, which push predictions towards either extreme, and/or alter these images (and predict the corresponding date) in a controlled manner. Nevertheless, it cannot provide explanations in real-life terms, nor identify features perceivable by humans. At the same time, human experts instinctively date scripts in terms of certain characteristics, however subjective, but are unable to measure the significance of each such feature towards assigning a date. In this preliminary examination, our aim is to detect patterns (not necessarily semantically clear at this stage) in the application of saliency maps.

1.2 The contributions of this work

C1.:

We developed a dataset of images of Greek papyri from Egypt, along with the dates assigned to them by experts, which we release in two variants: one with whole papyri fragments; the other with lines of writing extracted from the full-size images.

C2.:

We proposed a Convolutional Neural Network (CNN), which we call fCNN, that is grounded in a fragmentation-based augmentation strategy and which predicts the date of text-line images with a mean absolute error of 54 years, using a regression head, and a macro-average F1 of 61.5%, using a classification head, setting the state of the art for Greek papyri image dating.

C3.:

We used fCNN to precise the dating of the lines of papyri, whose previous dates are disputed or vaguely defined, and suggested using explainability as a method in order to explore dating-driving features. All our data, code, and predictions are publicly available: https://github.com/ipavlopoulos/palit.

2 Related work

Although researchers have suggested algorithms for the automated segmentation of papyri images to text-lines (Papavassiliou et al., 2010), and although the benefits of text-line segmentation are already known in the field of writer identification (Cilia et al., 2021), no published work to date has investigated dating computationally Greek literary papyri by focusing on text line images. The baseline is set by a CNN that is fed with entire Greek literary papyri images and which achieved a mean absolute error of more than a century (Paparrigopoulou et al., 2023). Our study shows that data segmentation into text lines leads to a much smaller error, with augmentation-enhanced CNNs providing the best-performing solution. In absence of other related work for Greek papyri image dating, we summarise in the following section the published work regarding dating in general (Omayio et al., 2022).

2.1 Image-based regression

CNNs outperform approaches based on feature engineering on writer identification (Nguyen et al., 2019; Fiel & Sablatni, 2015) and similar findings are reported in dating. In Hami et al. (2019), the authors used pre-trained CNNs to date images of medieval Dutch charters from 14th c. to 16th c. CE, by focusing on image crops. The authors reported a mean absolute error of 10 years, a number beyond our reach with papyrus data where an approximation of 50 years is accepted. Regression using pre-trained CNNs on random crops was also suggested in Wahlber et al. (2016), for the dating of medieval Swedish charters. Besides feature extraction with deep learning, earlier work approached the task with regression on top of extracted features, such as scale-invariant (He et al., 2016) or hinge and fraglets (He et al., 2014).

2.2 Dating from other modalities

Besides images, other modalities have also been used as input. In Kuma et al. (2012), for example, textual features were used to infer the date. Although reasonable in general, this is not a feasible approach for Greek literary papyri and manuscripts, the content of which (literary works) may have been composed several centuries before their actual copy on papyri (for instance, a papyrus penned in the 4th century CE that contains Homer’s work probably composed around the 8th c. BCE). A different approach was suggested in Rahiche et al. (2020), where ordinal classification was combined with multispectral imaging, tracking spectral responses of iron-gall ink (of historical letters, 17th–20th c. CE) at different wavelengths. Although rich, this data representation is very expensive in time and resources to establish, especially for fragile material, which also explains why datasets in this form are very rare. Besides, papyri are mostly written with carbon-based and not iron-gall ink, which is to the present more difficult to date.

3 Data

3.1 The nature of the papyri

As already mentioned, papyri bearing literary texts do not carry a date and, for the vast majority of them, papyrologists assign a date based on the affinity of their script with objectively (not palaeographically) dated specimens. These specimens, referred to as ‘objectively dated’ ones, are dated using external indications (not contained in the literary text on the papyrus) (Turner, 1987a). Occasionally, it is archaeological evidence or even radiocarbon dating suggesting a more secure date. Most importantly, papyri were often re-used after they exceeded their lifespan and literary texts are often found on papyri that have dated documents on the back side.

3.2 Digitised papyri

The images included in our dataset come from a number of collections and online resources, whereas five or six of them were scanned from images in printed volumes. Their digitisation took place during a period of more than two decades, under substantially different imaging protocols. As a result, they vary greatly in their properties, most importantly in scaling to actual size, colour capturing, resolution and bit depth. For a few of them, it was not possible to extract text lines, due to very low resolution, and they returned empty files during the segmentation stage.

3.3 Our new dataset

Our dataset comprises images of Greek papyri from Egypt and their respective dates, from the 1st to the 4th c. CE.Footnote 1 Images of papyri from other centuries were few, hence we did not consider them in this study. The papyri included were selected (i.e., excluding non-Greek, documentary, and ones with an unavailable image) from CDDGB, the only available collection of (somewhat) securely dated literary papyri, which includes also a few documentary texts in bookhand. The data it contains can be dated based on various objective dating criteria, such as the presence of a document that contains a date on the reverse side, internal evidence in the text (mostly for the few documentary ones and the 9th c. CE manuscripts having colophons), radiocarbon dating, or a dateable archaeological context associated with the manuscripts. In the CDDGB database, most records contain sampled images and we had to manually trace full-sized ones from the respective collections. We release our dataset in two forms, one where images contain whole fragments and one where they contain text lines.

Fig. 1
figure 1

Number of PLF images per century CE, or per century range when the date ranges between centuries, sorted by frequency

The Papyri Literary Fragments (PLF) dataset consisted initially of 242 images of publicly available papyri fragments from the 1 BCE to 9 CE. As can be seen in Fig. 1, most fragments come from the 2nd or 3rd c. CE, followed by the 9th and the 1st c. CE. When multiple fragments of the same manuscript were available, we included all of them. The date provided for most fragments is not specific. Typically, the minimum date range assigned to a literary papyrus spans 50 years, but it may reach up to two centuries. Most often, the latter cases concern a date between the two most frequent centuries (i.e., ‘2–3’ in Fig. 1). Our study focused on the first four centuries, from 1 to 4 CE, comprising 168 images of literary papyri; reduced to 159 by removing nine empty images from 2 and 3 CE. The final distribution across the four centuries (1–4 CE) was 20, 61, 60, and 18, respectively. We converted our images to grayscale to reduce dimensionality and to facilitate machine learning experiments. A potential bias regards dates from reuse (more than 40% of our data) as this method provides only one boundary of the interval whereas the other needs to be estimated.

The Papyri Literary Lines (PLL) dataset extends PLF so that images of the text lines of the fragments are provided instead of images of the whole fragments.Footnote 2 The 159 images were segmented automatically using the Transkribus HTR platform,Footnote 3 yielding 4,655 line images (i.e., approx. 30 lines per fragment). For this segmentation step, we used the default settings in Transkribus and did not train a specific baseline model, due to the multiformity of our material. We interfered minimally, by manually correcting text regions where none or very few lines were captured in the automatically generated segmentation. We also manually corrected a small number of baselines and line regions (appr. 1–2%), when no or insignificant amount of writing was captured, or when substantial and useful writing areas were obviously excluded. Even so, a considerable number of possibly useful lines were not added and in several cases, the automatic segmentation captured multiple lines in an instance, or a substantial amount of background with minimal writing. Also, we did not eliminate lines with noise, such as damaged papyrus surface, gaps in the writing material (holes), and lines bordering the edge of the papyrus. As a result, several line images still contain noise, which means that the dataset would benefit from more interventional curation.

Fig. 2
figure 2

Scatter plot of the width (shown horizontally) and height of the images measured in pixels

The balance of the centuries in PLL followed that of PLF, with 439, 2,116, 1,797, and 303 images from the 1st, 2nd, 3rd, and 4th c. CE, respectively. The scatterplot of PLL in Fig. 2b, shown aside that of PLF in Fig. 2a for comparison purposes, shows that most of the images have a height of 50 pixels or more, and a diverse width, perhaps explained by the fact that lines comprise texts of various lengths, from a single word to more than ten. We filtered out low quality images, with a height lower than 50 pixels or width less than 300 pixels, which resulted in 2,774 images in total (40% reduction). We release the annotations of the images, along with the URLs of the images, and a script to download the data.Footnote 4

4 Method

Our method, called fCNN, is a 43 m-parameter CNN that exploits augmentation so that it is robust to fragmented input, often met in papyri.

4.1 fCNN

The network consists of two Conv2D layers to represent the image of each text line, of 32 and 64 channels, respectively, followed by a 3-layer feed-forward neural network (FFNN). We used a convolutional kernel of size 5, single stride, zero padding, max-pooling (2x2), batch normalization, and ReLU activations. The FFNN receives a flat representation from the Conv2D, reduced to 1,024, then to 512 neurons. One output neuron yields the date in fCNNr and four followed by softmax in fCNNc. A ReLU activation function, batch normalization, and dropout of 0.1 is added per layer.

4.1.1 Synthetic fragmentation

Synthetic fragmentation is a possible augmentation channel during training. Papyri are very often fragmented, leading to partial information in the image to be dated. We exploited this pattern as part of our augmentation strategy by erasing randomly (0.5 probability) image fragments, setting their pixel values to 0.5. Images were also transformed with Gaussian blur (kernel size of 3), random affine (up to 3 degrees), as well as cropped and resized (by keeping the 1:6 aspect ratio).Footnote 5

4.2 The baseline

We used the state-of-the-art regression, which is achieved by ensembles (Fernández-Delgado et al., 2019), including Extremely Randomized Trees (XTR) and XGBoost (XGB). We experimented with both these regressors, using patches of 50*300 windows cropped from the center of each image, which was also represented with PCA-extracted 500-dimensional features. In our preliminary experiments, PCA led to better results compared to image binarisation using Canny edge detection and Otsu, which have been reported beneficial in writer identification (Nasir & Siddiqi, 2021). We used the implementation provided by sklearn setting all hyper-parameters to default values, besides the objective of XGB, which was set to the squared error.

5 Experiments

We experimented on the PLL dataset, using as input the text-line images and as output the dates of the respective papyri. Our experiments include casting the problem as a classification task, by predicting the century as one out of four labels.

5.1 Experimental details

We used Adam optimisation (Kingma & Ba, 2014) with a learning rate of 1e–3, batch size of 16, 200 epochs, and early-stopping with patience of 20 epochs. We used 2218 instances for training, 278 for testing, and 278 for validation. The regression variant was trained with a mean squared error loss and the classification variant with a cross entropy loss. We used PyTorch and we released our code in our repository.Footnote 6

5.2 The benchmark

A majority baseline (BLM), which always predicts 2 CE, achieved an MAE of approx. 0.632 and an MSE of 0.772. XTR and XGB perform better than this weak baseline, with a considerable difference when looking at MSE. The latter penalizes greater distances more, which means that the papyri of the 1st and 4th c. CE were better handled by XGB and XTR. Our fCNNr performs considerably better than all the baselines, achieving an average absolute error of 54 years.

Table 1 Mean absolute and squared error of dating along with their standard error of the mean in parenthesis

5.3 From regression to classification

By rounding the predictions of our fCNNr, we created a confusion matrix, which is shown in Fig. 3a. Confusion regards mainly neighboring centuries. The model correctly detects images from 2 CE and 3 CE while images from 3 CE may be predicted close to 2 CE, and vice versa. Difficulties in dating regard the two edges because 1 CE and 4 CE are more often predicted as 2 CE and 3 CE, respectively.

Although our task in hand is a regression one in nature, we also trained and assessed a classification variant (fCNNc), which learns to disregard the order of centuries and simply treat them as labels. In Fig. 3b, we observe that results improve across all centuries except 4 CE, where the difficulty remains approximately the same. This observation can be seen also in Table 2, which shows the F1 per century per fCNN variant. In this table, we also observe that a trained XGB classifier is better than an XTR one, but both fail compared to fCNNc.

Fig. 3
figure 3

Confusion matrices of fCNNr (rounded predictions) and fCNNc

Table 2 F1 per century of fCNNr (rounded predictions), fCNNc (along with the absolute difference compared to fCNNr), XGB, and XTR classifier

Despite the fact that both fCNN variants are trained on the same data, we note that we do not consider them as competitors. The regression-based fCNNr suggests a date, which can provide a very rough estimation of when the papyrus was written. If the predicted date was 280 CE, then this is an indication that the papyrus is dated between 3 CE and 4 CE, and that a year close to the latter is likelier. On the other hand, the classification-based fCNNc suggests a century and yields a score to indicate its confidence. If the predicted century was the 4th CE and the confidence was 80%, then this means that the network is confident that the date is the 4th c. CE and no other. Although our task at hand is one of regression, both can generate useful explanations. Therefore, since our end goal is to assist and not supplement the expert, we used them both in our explainability study, discussed next.

5.4 Explainability

Saliency maps (Simonya et al., 2013) reveal the parts of the image that are responsible for the network’s prediction. We experimented with both variants, fCNNr and fCNNc, and we used both, gradient- and perturbation-based attribution. In this study, we opted for fCNNc using gradient-based attribution, but we observe that explanations by the two variants can be combined to yield richer explanations.

Fig. 4
figure 4

Saliency maps for lines of papyri per century

We computed one heatmap per predicted line and we present a random sample of lines in Fig. 4. The heated colours show that the network consistently focuses on the letters in order to yield its predictions for the date. This means that the model is basing its prediction on the shape of specific letters, the distance between them, the size, or the intensity of the ink. By contrast, it seems invariant from background noise and other attributes which may be often present in Greek literary papyri. For example, gaps (holes in the papyrus) such as those in Figs. 4a and c, do not get any attention from the model.

6 Assessing data sources limitations

6.1 Inaccuracies in dating

CDDGB is not a product of targeted research on securely dated papyri, but rather a compilation of such examples mentioned in other papyrological works.Footnote 7 Hence, the collection is not comprehensive and the data included is not meticulously assessed by the compilers. Shortcomings concern the accuracy of some dates. Still and all, it is the same data of objectively dated papyri that papyrologists use as a reference for palaeographical dating. In this study, we introduce the computational factor in assessing scripts in connection with their assigned dates. Also, by focusing on the explainability of dating images of handwritten text, we do not consider these shortcomings detrimental. The possible inaccuracies in dating and the wide range of the assigned dates do not affect the explanations, which aim to provide pointers on features of the script.

6.2 Class imbalance

The imbalance in the size of the fragments and quantity of lines is an inherent issue owing to the nature of the available material. A papyrus may contain three or four usable lines, whereas others may have more than fifty. This does not affect dating significantly because, although test lines may come from a manuscript not hidden during training, each line constitutes a completely different image pattern. The same issue could be an advantage regarding explainability because possible features are brought out in a more controlled manner when multiple lines of the same manuscripts are involved. While some features, especially palaeographically insignificant characteristics, remain consistent (such as colour/intensity of the ink, texture and colour of the background, general size of script, scale, etc.), explanations can focus on pivotal ones.

6.3 Fragment leakage

Our train and validation subsets are mutually exclusive at the line level but not necessarily at the fragment level. Although the former is straightforward, the latter is not due to the diversity of lines in the fragments. Figure 5 shows this diversity, as well as the fact that most fragments comprise only a few lines. In other words, fCNN may have learned to detect the fragment instead of the century of the line. Although this would set potentially an interesting finding, this is not the scope of this study. Hence, we experimented with two different dataset split strategies, in order to reject this hypothesis (the number of lines per dataset per split is shown in Table 4 of Appendix A).

Fig. 5
figure 5

Number of lines per fragment (shown horizontally, sorted based on the number of lines). The distribution is exponential and most fragments comprise only a few lines

6.3.1 The Modulo-13 split

We kept lines from papyri with zero remainder when the number 13 divided their index for validation and testing (in half), keeping the rest for training. This strategy introduces a slight distribution drift, with relatively fewer lines from 1 CE during testing and more from 4 CE.

6.3.2 The few-lines split

We sorted the fragments based on the number of lines of each fragment. Then, we used the lines from the ones with the fewest lines (the rightmost ones in Fig. 5) to form the test set. We followed a stratified sampling based on the distribution of the centuries at the train data. For validation, we used fragments with few lines, one per century, combined with instances sampled from the train data to avoid a distribution drift at the century level and to assist the learning process.

6.3.3 Empirical analysis

The F1 per century of fCNNc, when it was re-trained and assessed according to the two new split strategies, are shown in Table 3. Both strategies have merits compared to the default, with modulo-13 being better than both few-lines and the default strategy. Based on these results, we reject the hypothesis that fragment leakage could drive the promising results presented in this study.

Table 3 F1 per century of fCNNc per split

7 Error analysis

To go further in our understanding of the relevance of our experiment, we provide in this section an error analysis, followed by an experiment to understand the way the model handles damaged papyri by ablating input images before dating.

7.1 Analysis

By studying fCNN’s deviations from the ground truth, we observed that these concerned predictions are toward the neighbouring century. Images from the two edge centuries, 1 CE and 4 CE, are scored up to 2 CE and 3 CE, respectively, the two most frequent centuries (Fig. 1). Images from 2 CE and 3 CE, on the other hand, were scored not far from each other, most often to 3 CE and 2 CE, respectively. By looking at the saliency maps of the misclassifications, we observed that letter-shaped noise, present in the source images, received the model’s focus.

7.2 Ablation

Our error analysis revealed that fragments may deceive our model. In order to investigate the model’s sensitivity, we fed fCNNc with test images, augmented with randomly-shaped black and white patches. We observe that the model’s focus changes according to the colour of the patch. White boxes appear to be disregarded by our model, in contrast to black boxes, which are receiving attention. An example is shown in Fig. 6, where the same line from a papyrus of the 3rd c. CE is altered in two ways. In Fig. 6b, the focus is everywhere except from the white patch. This is in line with our findings about breaks, which are also depicted in white in the papyri images (Fig. 4). By contrast, the black patch of Fig. 6a affects the prediction as if the model is guessing a missing character and as if the black colour of the patch was ink.

Fig. 6
figure 6

Saliency maps of the same test line, from a papyrus of the 3rd c. CE, whose source image was transformed either with a black or a white patch before dating

7.2.1 White patches

The addition of white patches on the images harms the performance fCNNc. The F1 per class dropped for each class, yielding a drop in macro-averaged F1 of 11 points overall (from 0.58 in macro-F1 to 0.47).Footnote 8 By hiding parts of the image, important information is lost. In other words, fragments hinder the model in its prediction task.

7.2.2 Black patches

The performance dropped significantly more when the patch was black (from 0.58 to 0.22 in macro-F1, using the same model). Although an equal portion of information was lost from the image, compared to the white patches, in this case, the model attempted to fill the black gap, guessing the missing context. This was shown in Fig. 6a but can be more clearly observed in the examples of Fig. 7, where each black patch is annotated with a red plate. Outside the patches, high temperature appears on pixels with ink and handwriting. Inside the patch, however, letter-looking structures are revealed in black (with low temperature) on a high-temperature background, as if the network focused on the letter’s contour and the space between the (guessed) letters. This guessing pushed fCNN to predict more often the 3rd c. CE, as if the reconstructed part of the image (the black patch) is common in papyri of that century, steering accordingly the prediction.Footnote 9

Fig. 7
figure 7

Saliency maps of black patches (annotated with red rectangles) on test images. A random sample per century is shown, all dated by fCNNc as 3rd c. CE

8 Dates in doubt: A computational estimate

fCNN can accurately predict the date of a text line image (Table 1) and, when the task is simply to predict the century and not an exact date, a classification variant that ignores the temporal relation of the labels yields even better results (Table 2). As was shown from our study of saliency-based explanations, fCNN focuses on the letters, that is the foreground and not the background (e.g., the blank parts of the papyrus sheet, the fibres, the holes and damages). In order to provide the experts with suggestions that could possibly improve the current dating,Footnote 10 we apply this network to loosely dated texts (across two centuries).

In our primary source, 11 papyri are dated either to 2 CE or 3 CE. Using fCNNc, we found that 87% of all the lines in these papyri are classified as 2 CE or 3 CE. Exceptions are 16 lines classified as the 1st c. CE and one that was classified as 4 CE. Figure 8 presents the analytical results.

Fig. 8
figure 8

Chronological attribution of fCNNc of lines in fragments dated between 2 and 3 CE

Using fCNNr, we attempted then to estimate a more precise chronology for the lines in these papyri. Despite the fact that our regressor was trained on ground truth at the century level, our expectation is that it will have learned to yield a chronology that is closer to the objective date. Figure 9 presents our predictions, organised per papyrus. The predicted dates for the lines of P.Oxy 3005, classified by fCNNc as 3 CE, are diverse, with the majority falling on the late 2nd and early 3rd c. CE. Overall, our network’s estimations agree with the range provided by the experts. The earliest prediction was 98 CE, for a line in P.Oxy. 661. This papyrus comprises parts of a poem by Callimachus and is dated from 150 to 250 CE,Footnote 11 with the first editor arguing that it is the late 2nd CE.Footnote 12 On average, our predictions suggest 200 CE, but some lines are predicted as early as 100 CE while others as 250 CE. The latest prediction is 270 CE for a line in P.Flor. II 120,Footnote 13 dated from 250 CE to 261 CE. In this papyrus, in very few lines our predictions agree with the experts because on average our network dates it before 200 CE. In P. Oxy. 4560, only one line is used, and the date is 100 CE. In P. Oxy. 232, although lines are few, all our predictions date the papyrus between 100 CE and 150 CE.

Fig. 9
figure 9

Chronological attribution of fCNNr of lines in fragments dated between 2 and 3 CE

8.1 P.Oxy 852: taking a stand

The first editors dated fragments of Euripides’ Hypsipyle (P.Oxy 852) between the late 2nd and early 3rd c. CE.Footnote 14 The literary text is written on the back of a reused scroll, containing an account that may be dated to 106 CE.Footnote 15 How long papyrus scrolls were preserved before being reused is subject to discussion and examples range from a couple of years to more than a century (Turner, 1978b). Our fCNN models agree with the editors, favouring late 2nd instead of early 3rd c. CE.

8.2 P. Bodmer XXIV: radiocarbon versus Paleography grounds

Based on radiocarbon dating, the resulting ranges of calendar dates for P. Bodmer XXIV were found to be 231-261 CE or 277-339 CE. These ranges were consistent with the palaeographic estimates of Kasser/Testuz & Turner on the 3rd or 4th c. CE (Stevens, 2023), but were casting doubt on the 2nd c. assigned by Roberts.

Fig. 10
figure 10

Histogram of the predictions of fCNNr for the images of lines of P. Bodmer XXIV. The predicted year is shown horizontally and the number of lines is shown vertically

We segmented each image of the papyrus to its 170 lines.Footnote 16 We discarded 25 lines because of their small size,Footnote 17 and fed the resulting 145 images to our fCNNc and fCNNr. The former predicted 121 lines from the 145 as 2nd c. CE, 13 as 1st and 11 as 3rd. The predictions of fCNNr, shown in Fig. 10, agree with this outcome. Hence, our fCNN agrees with both Roberts (more frequently) and Kasser/Testuz & Turner, probably picking up various characteristics of the script and the script variations that evolved through the centuries before becoming the norm. By leveraging explainability as a method, we study these characteristics further below.

8.3 XaM: explainability as a method

A significant portion of our training data is labelled for date relying to some degree on human estimations derived from palaeography. It is likely that the resulting explanations reflect the features that mostly contributed to date predictions, as perceived by both the model and the humans involved. Towards this goal, an accurate date prediction becomes less relevant and the models are repurposed, setting the lens to focus on the features that drive the predictions instead of the predictions themselves. We call this approach explainability as a method (XaM) and discuss it further below.

8.3.1 Ablating the input: disegni

Disegni are manual script reproductions established by drawing the papyrus letters on tracing paper with various writing implements, also achieving optimal binarization. Figure 11b shows a disegno we produced for P.Bodmer XXIV, originally shown in Fig. 11a, with ink splotches naturally being formed where the pen rests for a while. The disegno of Fig. 11c attempts to reduce these splotches by not resting the pen, while that of Fig. 11d further eliminates areas of high ink concentration by avoiding “letter shading”, i.e., variation in thickness or darkness of strokes within handwritten letters, due to differences in pressure applied by the writer’s hand, the angle of the writing implement, or the type of writing instrument used. Although we focused on P.Bodmer XXIV (Sect. 8.2) here, we note that our XaM-based approach is applicable to other use cases, scripts, and languages given data availability (Dhali et al., 2020; Boudraa et al., 2024).

Fig. 11
figure 11

P.Bodmer XXIV (Rahlfs 2110): MOTB.MS.0001709, f.18r l.18 (Original image: courtesy Museum of the Bible Collection. All rights reserved. ©Museum of the Bible, 2024.). From the top: a the original image; b disegno with full shading; c disegno with most shading eliminated; d disegno made with ballpen (no shading)

Fig. 12
figure 12

Century probabilities of fCNNc while dating the image and the disegni presented in Fig. 11

8.3.2 Driving features of date predictions

Fig. 12 shows the century probabilities of fCNNc (using softmax) for the original image and the disegni shown in Fig. 11. The prediction is 2nd c. CE for both the original image and the disegno with full shading. However, as shading is reduced, the prediction of the network gradually moves from 2nd c. CE to 3rd c. CE, a result that agrees with the carbon dating results of P.Bodmer XXIV (Sect. 8.2). The saliency maps of fCNNc (shown in Fig. 13 of Appendix C) show that explanations focus on greater concentrations of ink, with a strong preference to the intersections of strokes, as well as to the curvature of some others. These initial findings indicate that experimenting with controlled alterations of line images could produce more interpretable explanations and reveal decisive for dating script features. We discuss XaM further in Appendix C.

9 Conclusions

This work introduced a dataset of images of Greek literary papyri with whole papyri fragments (PLF) and with lines of writing (PLL). Our experiments showed that an augmentation-enhanced CNN predicts the date of text-line images with a mean absolute error of 54 years, using a regression head, and a macro-average F1 of 61.5%, using a classification head, setting the state of the art for Greek papyri image dating. An explainability study revealed that fCNN clearly focuses on letters to predict the date, following the palaeographer’s path. Also, by masking the images prior to dating, we showed that black patches (as in ink) affected the model’s performance considerably more compared to white patches (as in writing surface) and that fCNN explanations on these black patches appear as letter-looking structures. Finally, we applied fCNN on two papyri, whose dates have been debated among palaeographers, taking a stand and showing that dating-driving features can be revealed by leveraging explainability as a method. To this end, we used manually altered versions of the images in question, yielding improved dating predictions and more interpretable explanations.