Introduction

Archaeology, broadly defined, is the study of the human past through material remains: artefacts of various materials (e.g., stone, bone, pottery, metal, glass) that were manufactured, used, and discarded by ancient societies (Murray and Evans, 2008; Renfrew and Bahn, 2013). The first and most basic task of the field’s practitioners is to properly classify the numerous artefacts they encounter, determining their date, cultural attribution, form, function, socio-economic significance, and other features (Arkadiev, 2020; Dunnell, 1993; Hermon et al., 2004; Krieger, 1944; Whittaker et al., 1998). Such classifications often depend on prior knowledge, expertise, and preference for certain visual criteria over others (Barcelo, 1995).

In order to automate this process and utilise computers’ excellent pattern recognition capabilities, efforts have been made to incorporate computer applications into the processes of archaeological classifications (Derech et al., 2021; Tal, 2014). Notable among these are experimentations with machine learning models—computer algorithms that learn from data how to automatically detect patterns and make accurate decisions (Mitchell, 1997; Bishop, 2006; Duda and Hart, 1973). Several attempts were made to apply machine learning to archaeological materials (Barcelo, 2008, 2016; Barceló and Bogdanovic, 2015; Díez-Pastor et al., 2018; Macleod, 2018). However, at first, they relied on hand-crafted feature extraction, resulting in relatively poor performance measures (e.g., Boon et al., 2009). More recently, machine learning algorithms have been used to extract relevant features automatically. Thus, for instance, Agam et al. (2020) combined Raman spectroscopy with machine learning algorithms to quantitatively estimate different degrees of thermal alteration on flint artefacts.

Of particular interest is deep learning, and more specifically, Deep Convolutional Neural Networks (CNNs), which are commonly used to analyse images. CNNs were successfully applied to various computer vision tasks, as they can automatically extract features from input images (Cifuentes-Alcobendas and Domínguez-Rodrigo, 2019; He et al., 2016; Krizhevsky et al., 2017; Taigman et al., 2014). These features, also known as embeddings, are a set of numbers (1536 numbers in this case), that are later used by other computational layers, to classify/infer other useful information from input data. The features do not necessarily correspond to a realistic measure of the data, such as colour or shape. Applied to archaeological problems, CNNs have shown promise, successfully fulfilling tasks of ceramic classification (Itkin et al., 2019), periodic discrimination of lithic assemblages (Grove and Blinkhorn, 2020), and differentiation of bone surface modifications (Domínguez-Rodrigo et al., 2020). However, these experiments with CNNs focused on narrow ranges of materials and contexts, consequently failing to seriously confront the bewildering diversity of the archaeological circumstances and record.

Thus, in this paper, we seek to develop a CNN model able to navigate the full gamut of temporal and cultural diversity archaeology has to offer (Fig. 1). To do so, large publicly accessible repository of artefact photographs managed and maintained by the Israel Antiquities Authority (http://www.antiquities.org.il/t/default_en.aspx) was used. It presents archaeological items that span a million and a half years of Levantine hominin history. The base CNN was initially trained to classify everyday objects on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset (Russakovsky et al., 2015), which is a large dataset of natural images. Then, following some modifications to the CNN, the model was trained to identify archaeological artefacts according to period and site (Fig. 1a, b). Next, drawing on the model’s acquired capacity to correctly classify artefacts, it was determined whether it can be effectively used to detect communities (Fig. 1c)—cohorts of classes with a meaningful common denominator. Finally, a case-study is offered on communities found from Natufian culture (ca. 15,000–11,700 years ago) classes in the Levant, showing that this method found archaeologically meaningful similarities between different sites.

Fig. 1: A schematic representation of the machine learning-based workflow.
figure 1

A The training phase: a dataset of images of archaeological artefacts were grouped according to period and site, pre-processed, and used to train a Convolutional Neural Network (CNN). B The testing phase: the trained CNN was used to extract features from a query image and predict its class by identifying k-nearest neighbours in the training set. C Community detection: validation set predictions are aggregated in a confusion matrix that is later transformed into a weighted graph and fed to a community detection algorithm.

In this manner, CNN was applied to this diverse archaeological dataset. First, it was assessed whether it could predict artefact’s site and period using its image. Second, the possibility of finding other similar objects for a query image was investigated. Third, based on the results that the model made “correct confusion” (e.g., confusion between two different sites that are dated to similar archaeological period), the possibility of finding similarities between few sites was examined— which can potentially open up new avenues of analysis, research, and cultural interactions.

Below an account of archaeological dataset, procedures, and estimates of the model’s performances is provided. Towards the end of the paper, detailed account of the methods employed in various parts of the workflow is presented. Additional experiments, information, and technical details are provided in the supplementary material.

Dataset

The dataset is publicly accessible on the Israel Antiquities Authority (IAA) website (http://www.antiquities.org.il/t/default_en.aspx). It comprises 12,364 photographs of 6770 artefacts that derive from across the southern Levant and span the Lower Palaeolithic (1.4 million years ago; Bar-Yosef and Goren-Inbar, 1993) and the Late Islamic (fourteenth century AD) periods. They include stone tools (e.g., blades, flakes, bifaces), bone tools (e.g., awls, beads, and pendants), metal objects (e.g., spearheads, coins), pottery vessels, and figurative art. Most of the artefacts presented are complete, and every item is designated according to its site and period of origin. The attribution of periods was provided by archaeologists working for or within the Israel Antiquities Authority (IAA) and available at their website.

Artefact categorisation by site and period produced a total of 555 classes (e.g., Early Bronze II Jericho, Iron II Akhziv) of various sizes. While some were hundreds of artefacts large, others comprised merely two or three (Fig. S1). In order to maintain a balanced dataset and provide sufficient conditions for statistical manipulations, the dataset was narrowed to the 200 largest classes, encompassing a total of 9909 images of 5450 artefacts, constituting 80.1% of the photographs and 80.5% of the artefacts (Table S1). Next, the dataset was split in two: one for training, comprising 8031 images (81%) of 4428 artefacts, and another for validation, comprising 1878 images (19%) of 1020 artefacts.

Standard image classification relies on visual similarities (dogs, cats, cars, or faces of different identities). However, in this case, similar artefacts may belong to different classes (Fig. 2a), and visually distinct artefacts may belong to the same class (Fig. 2b). Furthermore, note that temporally adjacent periods are likely to incorporate visually similar artefacts (e.g., Early Roman and Roman amphorae). Therefore, to test this model, two levels of temporal discrimination were established: rough- and fine-period groups. The fine temporal classification consisted of 21 groups, while the rough classification observed 13 (Table S1).

Fig. 2: Classification and CNN model performance.
figure 2

A Nearest neighbour pairs of artefacts from different classes (the image on the left derives from the validation set, and the image on the right derives from the training set). B Pairs of distinct images that derive from the same class. C Validation set query images (left column) and the top-3 training set nearest neighbours. D A histogram of model performance for fine-period prediction on 63 randomly picked images (3 images per period); the straight horizontal line marks the average prediction accuracy (69.84%). E A histogram of two archaeologists' performances (blind experiments) for the same 63 artefact images used for D; the horizontal lines mark the average prediction accuracies for each archaeologist (44.44, 20.63%).

In order to facilitate the training process, the images’ background and scale were standardised. All photographs were furnished with homogeneous white background, and the scale was removed (see Methods section for more details).

Model construction

In order to optimise the CNN to the task of archaeological classification, the standard transfer learning procedure was followed. Transfer learning is usually used when the available database size for the target application is relatively small. In this case, in order to improve performance, a pre-trained CNN on another larger (unrelated) database is used as the starting point for the training process.

The CNN model was based on the ImageNet (Russakovsky et al., 2015) pre-trained image classification model EfficientNetB3 (Tan and Le, 2019), which was chosen for its superior performance (see below, methods). It was built by stacking many (hence deep) basic computation layers (convolutions, non-linearities, pooling, skip connections, etc.), striving to achieve the best balance between computation complexity and prediction accuracy. The model was pre-trained on the ImageNet ILSVRC dataset to predict an image’s category (class) out of 1000 possibilities, and reached 81.6% Top-1 and 95.7% Top-5 prediction accuracy (Top-k classification score computes the number of times the correct label is among the top k labels predicted). More details on this model are found in (Tan and Le, 2019).

To perform the transfer learning, the original classification layers were removed and a customised classification layer was added (a fully connected layer, that transformed EfficientNetB3 embeddings, of size 1536, to 200 classes). To optimise the training, five models were trained with the same ImageNet initialisation, each generating a different feature vector, which we then used to produce a final feature vector. To improve robustness and enrich the database, a standard data augmentation techniques was applied. These include: random rotations, spatial shifts, zoom, and horizontal flips. All CNNs’ layers were trained for 25 epochs (in each epoch, the model is trained on the entire training set), using the categorical cross-entropy loss function (most common loss function for classification tasks). Additional details can be found in the methods section below.

Results

In the testing phase (Fig. 1b), the CNN was use as an “archaeological” feature extractor, and measured the archaeological dissimilarity between artefacts by calculating the cosine similarity distance (see methods below) between their feature vectors. Predictions were made by looking at query image’s nearest neighbours, from the labelled training set. This procedure is illustrated in Fig. 2c that presents the three nearest neighbours for five query images. Interestingly, three sorts of outcomes are notable: (1) a complete match between the query image and the top three nearest neighbours (Early Roman Caesarea, Lower Palaeolithic Tabun, and Byzantine En Gedi), (2) a proximal match between query and prediction, pertaining to site or period but not both (Lower Palaeolithic Ubeidiya), and (3) mixed results where some of the neighbours are a full match and others are proximal (Crusader Atlit) (see also Fig. S5).

The procedure above was used to measure accuracy on the validation set and can be used to classify other artefacts in the future (there was no other test set in the setup). Since each item in the dataset set had few labels: period/site/period-site/rough, fine-period group, accuracy on each one of these options is reported, regardless of the training process.

Table 1 shows the model’s prediction accuracy for all possible labels on the validation set (i.e., period-site, site, period, fine, and rough-period accuracy grades). Accuracy values in this table were obtained after training with the standard period-site classification objective. Specifically, prediction accuracy was [%] of 58.10 (Top 1), 67.36 (Top-5) for period-site classes, and 76.36 (Top1), 85.41 (Top5) for rough-period groups. Model accuracy for each fine/rough-period group can be found in Fig. S2. The resulted confusion matrix and embeddings t-SNE visualisation (Van der Maaten and Hinton, 2008) can be found in Fig. S3 and Fig. S4, respectively.

Table 1 Prediction accuracy [%] for period-site, site, period, fine-, and rough-period grouping.

Another evaluation strategy entailed pitting the trained model against two archaeologists. Sixty-three query images of different artefacts were selected, three for each of the twenty-one fine-period groups. These images were then presented to two archaeologists, and the model to be assigned their appropriate temporal designations. The results indicate that the model performed as well as these two archaeologists within their field of expertise, and had a higher average accuracy level, when considering all possible periods. Thus, the model achieved an average accuracy score of 69.84% (Fig. 2d), while the archaeologists scored 44.44 and 20.63% (Fig. 2e).

Having attained these results, the best classification choice in the archaeological dataset was determined. To do so, an experiment was devised that entailed training the model with few classification objectives—only sites, only periods, or a combination of sites and periods—and compared their performance (see the additional classification experiments section in the supplementary material). It was found that (1) When trained on period-site classes, the model achieved the highest accuracy levels for all three parameters (period-site, period, and site); (2) When trained on periods, the model’s periodic attributions remained unchanged (compared to 1), while the precision of its period-site and site attributions dropped; (3) When trained on sites, the model’s accuracy levels were nearly as good as in 1.

On these grounds, it can be proposed that information about artefacts’ sites of origin carries significant weight for effective network learning. Therefore, it should be used in future works for classification with the periodic data (see the supplementary material for further details, Table S2). This is also the reason that network was train with period-site data also when it is tested only on the period information.

Community detection

A close review of the model’s prediction accuracy presented above suggests that most errors entail the confusion of neighbouring periods (e.g., a Pre-Pottery Neolithic A artefact mistakenly attributed to the Pre-Pottery Neolithic B). The propensity for such errors is readily illustrated by a chronologically sorted confusion matrix (Fig. S3), demonstrating that most errors clustered along the main diagonal (i.e., they occurred between nearby periods). While this observation can be read as indicating an inherent weakness in the model, it also indicates the model’s response to an actual condition: that visually similar artefacts often derive from temporally adjacent contexts. On these grounds, model’s ability to discern associations among classes (i.e., period-site designations) that can correspond to meaningful archaeological categories was explored. Technically, such clusters are termed ‘communities.’

The archaeological community detection method is illustrated in Fig. 1c. It starts by converting the confusion matrix into a network (i.e., graph) that consists of nodes and edges (i.e., links). In this case, each node represents a class, and each edge represents the confusion between classes that was registered in the confusion matrix. Next, the edges were weighted—they were given numerical values to capture their different strengths. An edge’s weight was computed as follows:

  1. (1)

    Let \({A}\,\in \,{\mathbb{R}}^{{C}\times{C}}\) be the normalised confusion matrix. C is the number of classes and Aij is the relative number of cases, where the true label is i and the predicted label is j. Note that this matrix is not necessarily symmetric, i.e., it may have Aij ≠ Aji.

  2. (2)

    Let \(B = \frac{1}{2}\left( {A + A^\prime } \right)\) be the symmetrical version of A.

  3. (3)

    Bij or Bji is the weight of the edge that connects nodes i and j.

Next, the Louvain community detection algorithm (Blondel et al., 2008) was applied to the network (Fig. 1c, Fig. S6, see methods section for more details), producing clusters—communities—of similar period-site classes. Twenty-eight communities were detected with a modularity score—a measure of the network’s division into communities—of 0.77.

In an attempt to achieve better communities, two further adjustments were introduced. The first consisted of rebuilding the confusion matrix to include ten nearest neighbour predictions for each query image instead of one. This modification resulted in more confusion and, by extension, a denser network with more edges. The second adjustment was to use only certain part of the confusion matrix, with several neighbour periods, before applying community detection (e.g., Palaeolithic–Epipalaeolithic periods; Bronze–Iron Ages). In this manner, irrelevant confusion is precluded, and a way is paved to explore more nuanced relations among classes. Thus, for instance, Table S3 presents the communities detected for three periodic groups: Palaeolithic–Natufian, Bronze–Iron Ages, Hellenistic—Byzantine periods.

Setting out to render these community detection procedures relevant for archaeological practice, an interactive computer application was developed geared to visually present classes and communities against their geographical setting (Fig. 3a). Thus, for instance, Fig. 3b offers an overview of the communities detected, Fig. 3c demonstrates the application’s node selection mode, where the user is presented with all community members associated with a specific node, and Fig. 3d presents a community of nine members (archaeological sites)—eight Roman and one Byzantine—around the Dead Sea.

Fig. 3: Map application for interactive community detection.
figure 3

A Classes in the database are represented by coloured nodes, where the colour represents period. The menu on the left allows the user to alter the period groups presented and find communities of interest. B Community detection of some period groups based on ten nearest neighbours. The number above each node represents its community. C Node selection mode: displays the community members of a selected class; In this example, they include (in period-site format) Iron I-Megiddo, Late Bronze Age II-Bet Shemesh, Late Bronze Age II-’Ujul, Late Bronze Age II-Megiddo, Late Bronze Age-Megiddo, Late Bronze Age-’Ujul, Middle Bronze Age II–Late Bronze Age-Megiddo, and Pre-Pottery Neolithic B-Jericho. D An example of a community clustered around the Dead Sea; it consists (in period-site format) of Early Roman-Horevot Mazada, Early Roman-Qumran Caves, Early Roman-’En Gedi, Roman-Horevot Mazada, Roman-Wadi Murabba, Roman-Mezad Rahel, Roman-’En Gedi, Roman-Nahal Mishmar Cave of the Treasure, Byzantine-Mesad Boqeq, and Roman-Nahal Hever.

The resulting communities’ validity may be tested against their members’ periodic attributions. If the community comprises one or two successive periods, we may consider the community valid. However, if the community includes outliers—i.e., members whose periodic attribution is inconsistent with the rest of the group— a problem may be assumed, or that there are interesting similarities that need to be further explored.

For example, community 1 in Table S3, in the Bronze–Iron ages, has the following members: Early Bronze I Megiddo, Early Bronze I Mizpah, Early Bronze I ‘Ai, Early Bronze II-III ‘Ai, Early Bronze III Jericho, Middle Bronze I Megiddo, Iron II Bet Mirsham. Iron II Bet Mirsham, is considered a-priori as an outlier because the periodic assignment is different from the rest Bronze classes. Therefore, it would be interesting to look at the confusion between artefacts in this community.

Notably, the number of outliers per community is a function of the range of periods included in the confusion matrix, that was used in the community detection. Therefore, the results should be carefully analysed and validated with archaeologists to compensate for insufficiently diverse or imbalanced datasets.

Community detection—Natufian case-study

To explore the potential of the community detection method, a case-study of the Natufian culture is presented here. Since its definition in the 1930s, the Natufian culture of the Levantine late Epipalaeolithic period (ca. 15,000–11,700 years ago) attracted considerable scholarly attention. There are two main reasons for this. First, the Natufian archaeological record suggests a shift from small nomadic human groups to sedentary hamlets in the Mediterranean zone, a unique event of settling down shortly before the transition to farming in the Neolithic Period (Bar-Yosef, 1998; Bar-Yosef and Valla, 2013). Second, while many Natufian artefact types resemble those of the early Epipalaeolithic and Upper Palaeolithic periods (e.g., pointy implements made of bone), many others are novel, producing unprecedentedly diverse assemblages that include abundant worked-stone and worked-bone items, art items, and personal ornaments. Consequently, Natufian artefacts can be found in museum and web exhibits, such as this database. A dataset of five rough-period groups was constructed, spanning the Middle Palaeolithic and the Pre-Pottery Neolithic B, thus constituting a temporal range up to two steps removed from Natufian elements (rough-period groups 2–6; Table S1). In this manner, it may be expected that a query of Natufian classes will find close ties with other Natufian classes, weaker ties with classes that are one step removed, and nearly none with classes two steps removed. Five communities were detected (all site designations follow the labels of the IAA picture database): (1) Natufian_Me’arat Kebara, Natufian_Me’arat ha-Nahal, Natufian_Magharat Shuqba, Pottery Neolithic A_Jericho-T. (2) Upper Palaeolithic_Me’arot Hayonim, Natufian_Me’arot Hayonim, Pre-Pottery Neolithic B_Nahal Hemar (3) Middle Palaeolithic_Me’arat Tannur, Middle Palaeolithic_Har Qedumim, Natufian_’Enot’ Eynan, Pre-Pottery Neolithic A_Har Harif (4) Upper Palaeolithic_Me’arat Kebara (5) Natufian_Me’arat Oren.

These communities demonstrate few interesting insights: first, in communities 4 and 5 there is only one class. It means that probably there was no confusion between this class to others, resulting in self-loops in the network. Second, in community 1, Pottery Neolithic A_Jericho is most likely an outlier, because it doesn’t belong to the Natufian period, like the rest of the members. Third, in community 2, there are two classes from the same archaeological site (Me’arot Hayonim), one dated to Upper Palaeolithic, and the second to the Natufian culture.

A close review of the details demonstrates that many of the affiliations among artefact images, upon which communities are subsequently established, were both visually similar and archaeologically significant. Figure 4 presents some examples of confusion between artefact images. Thus, Community 1 includes similar Natufian bone implements from different sites (Fig. 4a), Community 2 encompasses worked animal teeth from Upper Palaeolithic and Natufian Me’arot Hayonim (Fig. 4c), Community 3 contains flint tools from Middle Palaeolithic Har Qedumim and Me’arat Tannur (Fig. 4e), Community 4 consists of Upper Palaeolithic bone awls from Kebara Cave (Fig. 4g), and Community 5 includes Natufian worked-stone items from Nahal Oren (Fig. 4h).

Fig. 4: Natufian artefact image confusion in community detection case-study.
figure 4

The query images are presented in the left column, while, to their right, the nearest training-set neighbours are presented in order. If the neighbour is of the same class as the query (i.e., of the same site and period), it is placed in a green frame. Otherwise, a blue frame is used. A blank space indicates that the neighbour image detected by the model was assigned to a different community. A more detailed description of this figure can be found in supplementary material (Additional information for Fig. section). A Community 1, archaeologically meaningful confusion. B Community 1, wrong confusion. C Community 2, archaeologically meaningful confusion. D Community 2, wrong confusion. E Community 3, archaeologically meaningful confusion. F Community 3, confusion. G Community 4, correct predictions. H Community 5, correct predictions.

However, on several occasions, visual similarities among artefacts produced archaeologically false (or problematic) associations. In Community 2, Natufian bone awls were grouped with Pre-Pottery Neolithic B flint arrowheads, which were of similar shape and colour (ca. 10,000 years ago; Fig. 4d). In Community 3, Natufian implements made on ungulate long bones from ‘Eynan (Hula Valley, northern Israel) were grouped with similar Pre-Pottery Neolithic A artefacts from Har Harif (Negev Desert, southern Israel) (Fig. 4f).

The analysis above is only the tip of the iceberg, as it examined thoroughly some examples of confusion between community members. Researchers are encouraged to follow this procedure with other communities in the dataset (e.g., Table S3), or apply the community detection workflow on other archaeological databases.

Methods

This section provides additional technical details for particular parts of this work. Each subsection is concerned with a specific methodological or procedural component and does not communicate directly with the others.

Image pre-processing

The images that populated the database were collected without an image capturing protocol. Consequently, image capturing conditions varied considerably from one case to the next, mainly pertaining to issues of background and scale. To overcome this, homogeneous white background was implemented and removed the scale following one of two procedures: (1) automatic contour retrieval (Suzuki, 1985) performed on the output of the Canny edge detector (Canny, 1986) on the input image, or (2) the interactive GrabCut method (Rother et al., 2004). The second procedure is comparatively manual and used whenever the first procedure failed. To fit images to the model input spatial dimensions, the images were resized to 300x300 pixels.

Base network

To choose the base network, three ImageNet pre-trained models were evaluated. These include VGG (Simonyan and Zisserman, 2014), InceptionResNetV2 (Szegedy et al., 2017), and EfficientNetB3 (Tan and Le, 2019). We found that EfficientNetB3 was 1% more precise than the other two and needed fewer epochs for training.

Loss functions

Large Margin Cosine Loss (Wang et al., 2018) and cross-entropy loss functions resulted in similar classification accuracy measures while further training with online triplet mining and triplet loss (Schroff et al., 2015) improved results by around 1% on VGG and InceptionResNetV2. The final model was trained with the cross-entropy loss alone on EfficientNetB3.

Distance metric

Predictions for the query images were generated by determining their k-nearest training-set neighbours (k = 1 in this setup). For this purpose, the cosine similarity distance measure was used:

$$d\left( {{{{\boldsymbol{x}}}},{{{\boldsymbol{y}}}}} \right) = \cos \left( {{{{\boldsymbol{x}}}},{{{\boldsymbol{y}}}}} \right) = \frac{{{{{\boldsymbol{x}}}} \cdot {{{\boldsymbol{y}}}}}}{{\left\| {{{\boldsymbol{x}}}} \right\| \cdot \left\| {{{\boldsymbol{y}}}} \right\|}},$$

where \({\boldsymbol{x}}, {\boldsymbol{y}}\, \in{\mathbb{R}}^{D}\) is the feature vectors of two different input images, and D is the embedding vector length.

Voting of five CNNs

To optimise these results, five models with the same ImageNet initialisation were trained, each generating a different feature vector, which was then used to produce the final feature vector (ZRP). To do so, (1) the five feature vectors were concatenated, achieving \({Z}\,\in\,{\mathbb{R}}^{5D}\), where D is the single model feature vector length, and (2) randomly projected Z to a lower-dimensional space (due to memory limitations) by multiplying it with the random Gaussian matrix

$$Z^{RP}_{D\times1} = G_{D \times 5D}Z_{5D \times 1}$$

where ZRP is Z projected onto a lower D-dimensional subspace, and GD×5D is a random Gaussian matrix.

Training details

CNN weights were optimised by the AdamW optimizer (Loshchilov and Hutter, 2019) with an initial learning rate of 0.0001 divided by 10 when validation loss was not improving. The hardware used throughout these experiments is a single Nvidia GeForce 2080 Ti GPU, and the batch size was 20.

Community detection

A modularity score measures the quality of a network’s partition into communities (Blondel et al., 2008; Newman and Girvan, 2004). A high score indicates dense connections within communities and sparse connections between them. It is defined as the fraction of edges within communities minus the expected fraction had their distribution been random.

Derivation of the modularity formula starts with two nodes, v and w. The difference between the actual and expected weight between nodes v and w is calculated as follows:

$$A_{vw} - \frac{{k_vk_w}}{{2m}}$$

where (1) Avw is the weight between v and w, (2) ki = ∑jAij is equivalent to the degree of node i, and (3) \(m = \frac{1}{2}\mathop {\sum}\nolimits_{ij} {A_{ij}}\) is the sum of all weights in the graph (number of edges in a uniform-weights graph).

Summation over all pairs that belong to the same community will yield the modularity score Q:

$$Q = \frac{1}{{2m}}\mathop {\sum}\limits_{vw} {\left[ {A_{vw} - \frac{{k_vk_w}}{{2m}}} \right]\delta \left( {c_v,c_w} \right)}$$

where ci is the community of node i, and δ(cv, cw) equals one or zero if nodes v and w belong to the same or different communities, respectively.

Based on this metric, Blondel et al. (2008) introduced a popular community detection algorithm (Fig. S6). It is based on the iteration of two phases. First, each node is assigned to a different community, and the modularity gain of node i is calculated, should it be found to be in the same community as its neighbour j. After considering all possible neighbours, node i is placed in the community that produced the highest modularity gain. This process is repeated until no further improvement in modularity score is noted.

The second phase entails establishing a new network based on the communities detected in the first phase. Each community is represented by a node, and edges’ weights are determined by summating all the edges between communities, while edges within communities produce self-loops.

The combination of these two phases is called a “pass,” and it is repeated until the modularity score stabilises and maximum modularity is achieved (Fig. S6).

Prior confusion

Ambiguities concerning periodic attribution (e.g., Roman/Early Roman) may be considered a type of label noise. However, in practice, they are attributable to several closely related features of the archaeological record: (1) Most artefact types span several periods, (2) archaeological periods usually have vague boundaries, and (3) artefacts may vary in frequency across time and space while retaining their formal properties.

Motivated by Kaneko et al. (2019), attempts were performed to enhance the loss function with prior confusion knowledge. Let (xj,yj) be an image-label pair; given xj, the probability for label yj will be

$$p\left( {y_j\left| {x_j} \right.} \right) = \mathop {\sum}\limits_i {p\left( {y_j\left| {y_i} \right.} \right)p\left( {y_i\left| {x_j} \right.} \right),}$$

where p(yi|xj) is the ith output of the neural network’s final layer, when the input image is xj, and p(yj|yi) is the measure of ambiguity between labels yj and yi. For example, if there is 50% indeterminacy between Persian-Hellenistic and Hellenistic labels, it would be 0.5. The final cross-entropy loss for mini-batch with B images and C classes is

$${\mathrm{Loss}} = - \mathop {\sum}\limits_j^B {\mathop {\sum}\limits_i^C {t_{ij}\log p\left( {y_j\left| {x_j} \right.} \right),} }$$

where tij is the ith one hot encoding element of the label yj.

Notwithstanding the method’s potential, quantifying the prior ambiguity measure— p(yj|yi)—proved difficult, rendering it useless for this purposes.

Attempts were made to manage periodic indeterminacies by setting p(yj|yi) according to a Gaussian function. Unfortunately, this method did not improve the model’s accuracy measures.

Website for archaeological predictions

A website containing the pre-trained CNN model is availableFootnote 1. Researchers are invited to upload their query images and receive images of similarly labelled artefacts from the training set.

Conclusion

Machine learning is a powerful tool to explore large datasets. This paper describes the development of a deep-learning-based model for a diverse archaeological dataset that spans more than a million years of south Levantine material culture. It is particularly well-suited for purposes of artefact classification, potentially accelerating the interpretation of archaeological contexts. Moreover, based on the model, meaningful connections across artefacts, assemblages, and sites were automatically found.

Notably, archaeological classification is uniquely challenging. It is often ambiguous, and there is considerable room for controversy over dating. Moreover, archaeological assemblages are synchronically variegated, encompassing materially and visually distinct objects, but often diachronically similar. Harnessed this inherent quality of temporal ambiguity is key to find meaningful archaeological communities, recognising that the confusion of classes can underscore real connections.

At its most basic, this CNN can help archaeologists find similar artefacts and efficiently complete some of the more tedious and humdrum tasks of the profession. At its more advanced applications, the model can help archaeologists analyse large data bodies, find new previously unknown relations, and raise new archaeological questions. This workflow presented here can be applied to other datasets worldwide and has the potential to make way for significant archaeological insights.