1 Introduction

Digitising the past is a way to preserve history, restore deteriorating/uncompleted text, extract facts and information, and help in searching, document retrieval and data mining tasks. The digitisation of books/documents is among the objectives that current digital libraries and electronic government initiatives are putting on the top of their priorities. For example, dozens of universities, research centres and companies in Europe have together started a large-scale EU consortium called IMPACTFootnote 1 (Improving Access to Text) [1, 2]. Among the “Endangered Archives Programme” initiatives of the British Library is the digitisation of manuscripts of the Al-Aqsa Mosque Library, East Jerusalem [3]. This historical collection contains more than a hundred Arabic language titles that span over several Islamic periods from the ninth century CE to the end of the Ottoman rule in Palestine at the beginning of the twentieth century. These books span topics about the Arabic language and literature, logic, math, religion, and Sufism.Footnote 2

An old and still valid way to transcribe historical handwritten documents is to rely on crowdsourcing. It is the practice of gathering information or input into a task by acquiring the services of a large number of people (a.k.a. crowd). It is often characterised by small and short-term deals [4]. In a recent study, crowdsourcing, when combined with contemporary technology, is shown to deliver far more complete and validated data than automated processes alone could produce [5]. As such, the automated process of converting a historical document into a readable text is still posing various challenges. The work herein offers possible assistance in lifting some of these challenges by providing one of the most extensive free-access semi-annotated historical handwritten document datasets.

We conclude this section by noting that this dataset would enrich the availability of historical handwritten document datasets and help develop more accurate algorithms for word spotting, optical character recognition (OCR), document layout analysis and image binarization. It would also serve the research community interested in history and heritage (i.e., genealogists), see Fig. 1. This set of motifs was the ultimate impetus for the creation of the SHIBR dataset (the Swedish Historical Birth Records).Footnote 3 This complete-page dataset (SHIBR) complements the previously published numerical handwritten dataset (ARDIS) [6], both of which are generously provided for free by Arkiv Digital AD AB, a Swedish company.

Fig. 1
figure 1

SHIBR and what it serves: per discipline (a) and per community (b)

2 Review of related public datasets

Different public handwritten document image datasets have been created and presented to resolve various document image challenges such as text line segmentation [7], word spotting [8], writer identification [9], digit and character segmentation and recognition [10,11,12], binarization [13], and a variety of other challenges [14,15,16]. These datasets enable researchers to develop automated and computationally efficient algorithms. Generally, the existing datasets are classified into two groups based on the era in which they were written: historical or modern datasets. The well-known and widely used document databases are listed in Table 1, several of which are described in this section.

Table 1 Annotated handwritten image datasets in different languages that are publicly available

George Washington database (GW) [17, 18]: This database is a baseline database for text line segmentation, word spotting and word recognition tasks. The Washington database consists of 20 historical handwritten document images written in English with longhand script and ink-type pen in the eighteenth century. Moreover, these images are annotated with 656 text lines, 4894-word instances, 1471-word classes and 82 letters.

IAM database [22, 23]: The IAM database contains 1539 handwritten modern document images written by 657 different English scriptwriters. The documents were scanned at a resolution of 300dpi and stored in greyscale colour to create this database. The document images are labelled using an automatic segmentation approach and are verified visually. The database consists of 5685 isolated and labelled sentences, 13,353 isolated and labelled text lines, 115,320 isolated and labelled words.

VML-HD database [24]: This database includes 668 handwritten document images. The documents in the database were written in Arabic by different writers between the years 1088 and 1451. All the words and characters in the document images are manually annotated with bounding boxes. As a result, this database consists of 159,149 annotated words and 326,289 annotated characters.

HADARA80P database [25]: This database contains 80 handwritten document images written in the Arabic language. This database is used for word segmentation problems, as the words in the document images are annotated with polygons.

Esposalles database [27, 28]: The Esposalles database is a Spanish historical handwriting document image database consisting of 173 document images. The documents were written between 1451 and 1905, and they contain information from the marriage licenses of Spanish citizens. These documents are collected from different books available at the archives of the Cathedral of Barcelona. All text blocks, lines, and transcriptions in the document images are manually labelled. Furthermore, this database has been used to develop handwriting recognition algorithms.

Other handwritten document image databases have also been created and can be found with more details in [19, 21, 26, 29]. Furthermore, the proposed SHIBR document dataset is comprehensively scrutinised and described in Sect. 4.

2.1 Limitations of existing document images databases

Even though some of the existing datasets have annotations, most of them have several limitations: (1) scarcity of a large number of document images; (2) lack of datasets with Swedish characters; (3) lack of availability of historical documents written in Swedish handwriting styles with various types of dip pens; and (4) lack of availability of datasets with significant variations of artefacts (e.g., degradation, bleed-through, ink leakage etc.). For instance, the George Washington dataset contains the least number of document datasets with 20 English document images. In contrast, the IFN/ENIT dataset, the most extensive document image database (see Table 1), consists of 6735 binary Arabic document images. Therefore, the main challenges when using these datasets for historical document image analysis are (1) dealing with a small number of document images which leads to small intra- and inter-class variations, (2) exhibiting a small number of artefacts, and (3) covering historical ancient handwriting styles insufficiently. Therefore, to support the development of research in historical document image analysis, it is essential to construct a new dataset that would address the shortcomings of the existing datasets. Thus, this paper proposes a new and large dataset (SHIBR) containing 15000 document images, which is the largest of its kind as far as we are aware. The SHIBR dataset is semi-annotated, easing the development of automated and semi-automated machine learning methods for document analysis applications. Furthermore, to the best of our knowledge, the SHIBR dataset is the largest historical handwritten document dataset and the first semi-annotated historical document image dataset with Swedish characters.

3 Challenges and opportunities in historical handwritten documents

In the context of handwritten document image analysis, many challenges need to be tackled [30]. Most of the state-of-the-art solutions focus mainly on word spotting and recognition challenges.

3.1 Word/pattern spotting

Over the past decade, an enormous collection of handwritten or machine-printed documents have been digitised to preserve the contained information. Word spotting methods are used to extract relevant information from these documents. Generally, word spotting in handwritten document images is much more complex than on machine-printed document images—the former consists of significant variations of handwriting styles and various character types in different languages. As the text in these historical documents was written by different writers, it generates large variations in appearances because of, on the one hand, skewness, curvature, aspect ratio, and size, and because of broken and connected words/characters on the other hand. As a result, these variations in writing styles in different languages may create endless diversities for word spotting in handwritten document images. Many word spotting methods have been proposed for document indexation. They can be classified into two groups: 1) segmentation-based and 2) segmentation-free methods.

Matching is a segmentation-free word spotting approach. It is the process of searching a target or template word image in document images. It is one of the basic approaches that have been applied for word spotting on document images. Moreover, it is mainly employed based on similarity or distance measure between the template word image cropped from a document image and the document images’ target region. In [31], a word-level matching scheme is proposed to search a template word image in printed document images using a feature-extraction technique. After that, the extracted features are used for similarity estimation for word spotting. In [32], another word spotting framework is proposed; the dynamic time warping (DTW)-based matching technique. The DTW matching algorithm is applied on machine-printed document images, providing superior results for word spotting. In [33], a block adjacency graph (BAG) method for word spotting is designed and employed based on similarity estimation between the template image and the moving window regions in document images. A word shape coding scheme is proposed by Bai et al. [34] that combines feature descriptors and a matching technique for word spotting in document images. A block-based document image descriptor used for word spotting in historical printed documents based on the template matching process is proposed by Rabaev et al. [35]. Their experiments show that this method provides promising results if the documents do not include too many undesired artefacts. Other word spotting-based matching methods can be found in [36,37,38]. The matching-based word spotting techniques have several drawbacks: 1) they are time-consuming, 2) they cannot overcome undesired artefacts involved in the handwritten documents, and 3) they often yield poor accuracy rates on handwritten text images.

Thus, learning-based segmentation-free word spotting techniques are designed and applied to increase word spotting accuracy. For instance, the Hidden Markov Models (HMMs) technique has been used for word spotting in handwritten documents [39, 40]. Besides, hybrid models of HMMs with different supervised learning methods have been developed that combine HMMs with support vector machine (SVM) [41] or with neural network (NN) [42] or with deep convolutional neural network (CNN) [43]. In another work, a new word spotting system for handwritten Urdu language document images is proposed [44]. The method uses several pre-processing steps such as binarization, connected component analysis and edge detection. Subsequently, for word spotting purposes, a sliding window based on an SVM classifier is used to spot Urdu words. In [45], a word spotting and recognition approach based on a common representation of word images and text strings is proposed. The method first extracts the standard features to decrease the dimensional space, and then the nearest neighbour algorithm is used for word spotting. Frinken et al. [46] have designed a novel method based on recurrent neural network (RNN) for word spotting in handwritten documents. In [47], an efficient patch-based framework combined with the scale-invariant feature transform (SIFT) descriptors is proposed for keyword spotting in historical document collections. In [48], a CNN architecture is designed for word spotting in handwritten documents. Extensive survey papers of words spotting methods can be found in [49,50,51,52,53].

SHIBR lends itself well as a challenging benchmark for word spotting methods. Since the SHIBR dataset, like many other historical documents, does not have segmented words, we exclusively look at the segmentation-free methods. These methods may, for example, rely on traditional computer vision to find interesting regions or on region-proposal networks like Faster-RCNN [54]. The Ctrl-F-Mini algorithm fulfils this criterion and has been shown to outperform many existing methods [55]. We, therefore, select Ctrl-F-Mini for the first benchmarks on SHIBR. While its dependence on bounding boxes during training makes it difficult to train on SHIBR, its pre-trained models may be readily evaluated on the dataset. Ctrl-F-Mini is a deep convolutional network model for segmentation-free word spotting. It is related to the Faster-RCNN alternative but uses Dilated Text Proposals (DTP) [55] instead of a Region Proposal Network (RPN) [54] to propose regions of a manuscript page potentially containing words. For each region, the network outputs the estimated probability of it depicting a word and a word embedding. The word embeddings are either the Pyramidal Histogram of Characters (PHOC) or the Discrete Cosine Transform of Words (DCToW) [55] embeddings. In word retrieval, the regions are ranked by their embeddings’ cosine similarity to the query string embedding.

3.2 Opportunities: genealogy research

Genealogy and the study of family history/tree are both interlinked. In the modern era, and with the success and spread of DNA sequencing in high-throughput genomic sequences, genealogy has become a vivid field. However, genealogists are also interested in unravelling the history of families by mining historical documents. Nevertheless, there exist contentious points that Hatton [56] has rethought in a study examining history, lineage, identity, and technology in relation to genealogy. As depicted in Fig. 2, the top twenty countries most interested in genealogy show diversity, with Sweden coming slightly above the USA. The statistics are retrieved from the Google Trends tool (a website by Google that analyses the popularity of top search queries in Google Search across various regions and languages).

Fig. 2
figure 2

Top twenty countries most interested in genealogy as recorded by the Google Trends tool between 2004–2020

3.3 Opportunities: window into the past

A window into the past may allow the investigation of assumptions about the historical position and actual practices and thinking of a particular epoch, presenting genealogists with opportunities to further their theories. For instance, Abildgren, K. explores what a crowdsourced genealogical online database can say about Denmark’s income inequality during the First World War [57]. Additionally, Zhu Z. constructs the concept system and relationships of the genealogy ontology and takes Wu Shilai, the ancestor of the 23rd generation of Wu’s, as an example to realize the visualisation of traditional Chinese genealogy [58].

4 SHIBR dataset

This dataset is retrieved from the Arkiv Digital AD AB image and index database. When a child was born in Sweden in the 1800s, he or she was registered by a priest in a church record book called Birth and Christening Records. These priests registered the child’s name, when the child was born and baptised, where the child lived, and information about the father and the child’s mother, as shown in Table 2. The transcription is based on manual annotation (at Arkiv Digital AD AB and its partners) of scanned images from 1800 to 1840.

Table 2 Columns in SHIBRp’s accompanying transcribed file

4.1 Structure of SHIBR

The master dataset (SHIBRm) consists of 818,110 indexed rows and 64,084 images. This dataset is confidential and can only be used according to the agreement between the company Arkiv Digital AD AB and the Blekinge Tekniska Högskola (BTH). However, a subset comprising random samples from the period 1800–1840 was determined to abide by the GDPR (the European General Data Protection Regulation) law and, therefore, can be made open access. As such, the public dataset (SHIBRp) with semi-annotation consists of 10,500, 2250, 2250 images for training, testing and validation, respectively. Hence, in total, the SHIBRp public dataset consists of 15,000 high-resolution (2000 × 1300 to 6000 × 4000) images in RGB (Red, Green and Blue) colour space (~ 50 GB of data) that exhibit a variety of layouts, handwriting styles, background colour and degradations. Additionally, SHIBRp is associated with Excel spreadsheets for each of the three folders (training, testing and validation). In total, the spreadsheets contain 191,301 entries.

Swedish counties (län) covered: The counties that are recorded in these books are as follows. Gotland, Gävleborg, Norrbotten, Västerbotten, Västernorrland, Västmanland, Älvsborg,Footnote 4 and Örebro (see Fig. 3).

Fig. 3
figure 3

Source: Familysearch.org], showing some of the Swedish counties reported in the SHIBR dataset

Map of Sweden, as before the year 1997 [

Description of the index columns: Each image (of the 15,000) corresponds to a double page of a book, and each of these images is associated with an entry in the annotation file (manual transcription) with 17 columns, as shown in Table 2. The structure of the transcribed file, along with a sample image, is exemplified in Fig. 4.

Fig. 4
figure 4

Example of a scanned book page and the view of a file containing transcribed data (17 columns). See Table 2 above for descriptions of columns 1–17

4.2 Mining SHIBRm – Statistical insights

Data mining methods aim at discovering frequently occurring patterns in a source dataset [59]. Here, we deploy simple basic statistics to identify potentially helpful information associated with the document images in the SHIBRm pertaining to the nineteenth century’s era. The findings listed here are merely examples of the uncharted side of the SHIBRm dataset and of what its public version, SHIBRp, can offer to the research community, especially to genealogists. Please note that the statistics drawn herein are only reflections of the set of data we currently have (i.e., SHIBRm). In no way should they be taken as a de facto reference to the total population’s overall reality during that era. We only analyse and describe the data that we have.

  • Birth rate stratified by county/year: By examining the birth rate, we can see that it has an overall increasing trend and seemingly aligned with the public statistics.Footnote 5 When the retrieved data is stratified by the county, we see, at the macro-level, that Älvsborgs län exhibits a large birth rate, as shown in Fig. 5. For example, this could be linked to socioeconomic status.

  • Rate of stillborn (dödfödd): Analogues to the birth rate, and probably driven by it (dependent variables), is the death rate which also exhibits an increasing trend at the macro-level. Table 3 shows that Gotlands län tops the list with 1.857% of total newborns, and at the bottom of the list is Norrbottens län with 0.749%. Worth noting is that baby boys consistently exhibit higher death rates than baby girls in the data we have spanning the period 1800 to 1840. We are uncertain if this difference is genuine and descriptive of the total population in that period. However, this finding is consistent with a recent report published by the Statistics Sweden SCB (a Swedish agency) stating that “infant mortality has been higher for boys than for girls but this difference between the sexes is almost non-existent in the twenty-first century” and with the Historical Statistics of Sweden [60]. Figure 6 shows the overall mortality rate stratified by gender, and Table 3 tabulates the rate at each county.

  • Period until baptised: Baptism is a Christian rite for acceptance and adoption into Christianity. It was socially unacceptable not to baptise a child during that era. SHIBRm stores the number of days from birth to baptism. Figure 7 illustrates Älvsborgs län as the county with a minor time interval between birth and baptism. Norrbottens län and Västerbottens län top the list, probably because being located in northern Sweden, where a large part of the population is made up of the Sámi people (Laplanders in English) who may have likely had to travel long distances to churches for baptism.

  • Most common first names (babies, women, men): Another typical trend to look at is the popularity of the first names given to babies and parents. Some of these names are fading away in modern Sweden, allowing for more trendy ones, especially among the young generation. Table 4 shows the top 10 most common names among newborns, fathers and mothers.

  • Age of women in the birth records: What is the average age of mothers in the birth records of SHIBRm during the period 1800–1840? Fig. 8 depicts a bar chart of the age distribution of women in all counties. More in-depth statistics stratified by counties are tabulated in Table 5.

  • Most typical job titles (women/men): A final factor we look at in SHIBRm data is the job title of men and women during that period. As can be seen from Table 6, most of the men were farmers, military officers, government employees (tax officers), etc. Women during that period were not empowered to have their own jobs, and thus we see a lack of job titles in their column. Maids and housekeepers were the only reported jobs we found in the large dataset SHIBRm.

Fig. 5
figure 5

Birth rate aggregated from the eight counties (right) and stratified by county (left)

Table 3 Stillborn rate stratified by counties
Fig. 6
figure 6

Stillborn rate in all counties

Fig. 7
figure 7

Days between a baby is born until being baptised, stratified by counties

Table 4 Ten most common names in Sweden in the period between 1800 and 1840
Fig. 8
figure 8

Overall age distribution of women in the birth records (1800–1840)

Table 5 Age statistics of women in the birth records stratified by counties (1800–1840)
Table 6 The most common job titles for men/women during the period (1800–1840)

Back in the days, the Swedish written language was not standardized like it is today. Most people could not read and write. In 1842 there was a law enforcing that every child must go to primary school in Sweden. Only certain groups of people could read and write, for example, priests. As such, the spelling of words and names could be different across Sweden and between individuals. This asymmetry exhibits heterogeneity in both spelling and writing. Historical texts heterogeneity adds more complexity to perform robust image processing and recognition tasks. However, after the first-round manual transcription of the SHIBR dataset by the company’s (Arkiv Digital AB) partners, the company carried out a validity check by Swedish native speakers to improve the transcription quality.

5 Experiments and results

In this section, we present experimental results of the chosen algorithm’s performance, Ctrl-F-Mini [55] (discussed in Sect. 3.1), on page retrieval in SHIBR. In all cases, we use the trained models and implementation made available by the original authors. These models were trained on the ADAM dataset with a learning rate of 0.001, multiplied by 0.1 at every 10,000 steps. After a total of 25,000 steps, the model with the highest hold-out-set score was used. While the Ctrl-F-Mini comes with models trained with different loss functions, we select the ones trained with the cosine loss as they have the best benchmark results. However, we evaluate the PHOC and the DCToW embedding variants. Finally, since one model is provided per fold, we always select the one corresponding to the first fold.

The SHIBRp scanned pages are converted to greyscale colour space and pre-processed using the model’s provided settings for the George Washington dataset. We use an NMS (Non-Maximum Suppression) threshold of 0.4 for the DTP regions for all experiments as this corresponds to the negative-match threshold in Ctrl-F-Mini [55]. We do not use any additional word-likeness filtering, and we do not limit the number of regions per document before predicting the embeddings.

5.1 Segmentation-free evaluation of word spotting

The most used metric for evaluating word spotting is the Mean Average Precision (mAP) [61] . For a given query q, the precision at k is:

$$P_{k} \left( q \right) = \frac{{\left| {\left\{ {{\text{top}}~k~{\text{candidates}}~{\text{for}}~q} \right\} \cap \left\{ {{\text{instances}}~{\text{of}}~q} \right\}} \right|}}{k} = \frac{1}{k}\mathop \sum \limits_{{j = 1}}^{k} r_{j} \left( q \right)$$
(1)

where the relevance indicator rj(q) is 1 if the candidate with rank j matches q and 0 otherwise. Averaging the value over all possible k gives the Average Precision (AP)

$${\text{AP}}\left( q \right) = \frac{{\mathop \sum \nolimits_{{k = 1}}^{n} P_{k} \left( q \right)r_{k} \left( q \right)}}{{\left| {\left\{ {{\text{instances~of}}~q} \right\}} \right|}}~~$$
(2)

To finally retrieve the mAP, the AP is averaged over the set of all queries \(Q\) for the task:

$$~~{\text{mAP}}\left( Q \right) = \frac{1}{{\left| Q \right|}}\mathop \sum \limits_{{q \in Q}} {\text{AP}}\left( q \right)$$
(3)

The mAP is calculated with instances of individual words to evaluate word spotting on datasets with bounding boxes. The relevance indicator rk(q) is determined by the word matching to ensure a particular overlap between the retrieved candidate and the ground truth bounding box. Since SHIBRp does not contain bounding boxes, this is not feasible.

To evaluate a word spotting algorithm on semi-annotated datasets like SHIBRp, we instead propose evaluating the mAP with respect to page retrieval (mAPpage). Instances are therefore scanned pages instead of individual words. The set of queries is chosen to be the set of all words occurring in the text. All queries are performed over all the pages, and the results are then ranked according to their single best match for a given query. Finally, relevance rj(q) is indicated if q occurs on a page at rank j.

5.2 Results

As a baseline on the SHIBRp test set, we compute the mAPpage with respect to page retrieval (Sect. 5.1) using the Ctrl-F-Mini models trained on the George Washington Dataset [17] and on the IAM Offline Handwriting Dataset [23]. We use the 100 most commonly occurring words in the test set as the set of queries \(Q\). The results, including a random baseline corresponding to randomly ranking all instances for each query, are presented in Table 7.

Table 7 mAPpage with respect to page retrieval on the SHIBRp test set described in Sects. 5.1 and 5.2. The result for the random baseline is the mean ± standard deviation over 100 trials

As might be expected, the models trained on the much larger IAM dataset are consistently slightly better than those trained on the George Washington dataset. However, none of the results significantly outperform the random baseline. We would like to highlight two particular challenges that could contribute to the poor results. First, we acknowledge that none of the queries, Swedish names, are present in the training sets. Previous results have indicated that out-of-vocabulary queries tend to be more difficult for word spotting models [55]. Second, none of the provided models has been exposed to Swedish characters or accounted for in the embeddings. Third, the cursive and connected nature of text lines in our dataset and the variation in the writing style hinder achieving robust results using existing pre-trained models [62]. Building generally applicable word spotting applications requires overcoming these challenges. Furthermore, since real-world applications (e.g., when dealing with big data) also perform page retrieval, the mAPpage represents an essential and complementary metric for evaluating word spotting methods.

6 Conclusion

This paper contributes to the research community by providing an open-access large dataset of historical handwritten documents. These documents result from a continuous effort to digitise church birth books available from parishes across Sweden. The dataset that we provide comprises 15000 high-resolution images in RGB colour space and transcribed files containing a wealth of information spanning a period of time from 1800 to 1840. The main goal of sharing the SHIBR dataset with the research community is to spark more initiatives to further develop robust document analysis algorithms (e.g., word spotting, document retrieval, character recognition, binarization, layout analysis, etc.) and to promote cross-disciplinary research studies.