SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset

This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’ performance. The dataset is called SHIBR (the Swedish Historical Birth Records). The contribution of this paper is twofold. First, we believe it is the first and the largest Swedish dataset of its kind provided as open access (15,000 high-resolution colour images of the era between 1800 and 1840). We also perform some data mining of the dataset to uncover some statistics and facts that might be of interest and use to genealogists. Second, we provide a comprehensive survey of contemporary datasets in the field that are open to the public along with a compact review of word spotting techniques. The word transcription file contains 17 columns of information pertaining to each image (e.g., child’s first name, birth date, date of baptism, father's first/last name, mother’s first/last name, death records, town, job title of the father/mother, etc.). Moreover, we evaluate some deep learning models, pre-trained on two other renowned datasets, for word spotting in SHIBR. However, our dataset proved challenging due to the unique handwriting style. Therefore, the dataset could also be used for competitions dedicated to a large set of document analysis problems, including word spotting.


Introduction
Digitising the past is a way to preserve history, restore deteriorating/uncompleted text, extract facts and information, and help in searching, document retrieval and data mining tasks. The digitisation of books/documents is among the objectives that current digital libraries and electronic government initiatives are putting on the top of their priorities. For example, dozens of universities, research centres and companies in Europe have together started a large-scale EU consortium called IMPACT 1 (Improving Access to Text) [1,2]. Among the ''Endangered Archives Programme'' initiatives of the British Library is the digitisation of manuscripts of the Al-Aqsa Mosque Library, East Jerusalem [3]. This historical collection contains more than a hundred Arabic language titles that span over several Islamic periods from the ninth century CE to the end of the Ottoman rule in Palestine at the beginning of the twentieth century. These books span topics about the Arabic language and literature, logic, math, religion, and Sufism. 2 An old and still valid way to transcribe historical handwritten documents is to rely on crowdsourcing. It is the practice of gathering information or input into a task by acquiring the services of a large number of people (a.k.a. crowd). It is often characterised by small and short-term deals [4]. In a recent study, crowdsourcing, when combined with contemporary technology, is shown to deliver far more complete and validated data than automated processes alone could produce [5]. As such, the automated process of converting a historical document into a readable text is still posing various challenges. The work herein offers possible assistance in lifting some of these challenges by providing one of the most extensive free-access semi-annotated historical handwritten document datasets.
We conclude this section by noting that this dataset would enrich the availability of historical handwritten document datasets and help develop more accurate algorithms for word spotting, optical character recognition (OCR), document layout analysis and image binarization. It would also serve the research community interested in history and heritage (i.e., genealogists), see Fig. 1. This set of motifs was the ultimate impetus for the creation of the SHIBR dataset (the Swedish Historical Birth Records). 3 This complete-page dataset (SHIBR) complements the previously published numerical handwritten dataset (ARDIS) [6], both of which are generously provided for free by Arkiv Digital AD AB, a Swedish company.

Review of related public datasets
Different public handwritten document image datasets have been created and presented to resolve various document image challenges such as text line segmentation [7], word spotting [8], writer identification [9], digit and character segmentation and recognition [10][11][12], binarization [13], and a variety of other challenges [14][15][16]. These datasets enable researchers to develop automated and computationally efficient algorithms. Generally, the existing datasets are classified into two groups based on the era in which they were written: historical or modern datasets. The well-known and widely used document databases are listed in Table 1, several of which are described in this section.
George Washington database (GW) [17,18]: This database is a baseline database for text line segmentation, word spotting and word recognition tasks. The Washington database consists of 20 historical handwritten document images written in English with longhand script and inktype pen in the eighteenth century. Moreover, these images are annotated with 656 text lines, 4894-word instances, 1471-word classes and 82 letters.
IAM database [22,23]: The IAM database contains 1539 handwritten modern document images written by 657 different English scriptwriters. The documents were scanned at a resolution of 300dpi and stored in greyscale colour to create this database. The document images are labelled using an automatic segmentation approach and are verified visually. The database consists of 5685 isolated and labelled sentences, 13,353 isolated and labelled text lines, 115,320 isolated and labelled words.
VML-HD database [24]: This database includes 668 handwritten document images. The documents in the database were written in Arabic by different writers between the years 1088 and 1451. All the words and characters in the document images are manually annotated with bounding boxes. As a result, this database consists of 159,149 annotated words and 326,289 annotated characters.
HADARA80P database [25]: This database contains 80 handwritten document images written in the Arabic language. This database is used for word segmentation problems, as the words in the document images are annotated with polygons.
Esposalles database [27,28]: The Esposalles database is a Spanish historical handwriting document image database consisting of 173 document images. The documents were written between 1451 and 1905, and they contain information from the marriage licenses of Spanish citizens. These documents are collected from different books available at the archives of the Cathedral of Barcelona. All text blocks, lines, and transcriptions in the document images are manually labelled. Furthermore, this database has been used to develop handwriting recognition algorithms.
Other handwritten document image databases have also been created and can be found with more details in [19,21,26,29]. Furthermore, the proposed SHIBR document dataset is comprehensively scrutinised and described in Sect. 4.

Limitations of existing document images databases
Even though some of the existing datasets have annotations, most of them have several limitations: (1) scarcity of a large number of document images; (2) lack of datasets with Swedish characters; (3) lack of availability of historical documents written in Swedish handwriting styles with various types of dip pens; and (4) lack of availability of datasets with significant variations of artefacts (e.g., degradation, bleed-through, ink leakage etc.). For instance, the George Washington dataset contains the least number of document datasets with 20 English document images. In contrast, the IFN/ENIT dataset, the most extensive document image database (see Table 1), consists of 6735 binary Arabic document images. Therefore, the main challenges when using these datasets for historical document image analysis are (1) dealing with a small number of document images which leads to small intra-and inter-class variations, (2) exhibiting a small number of artefacts, and (3) covering historical ancient handwriting styles insufficiently. Therefore, to support the development of research in historical document image analysis, it is essential to construct a new dataset that would address the shortcomings of the existing datasets. Thus, this paper proposes a new and large dataset (SHIBR) containing 15000 document images, which is the largest of its kind as far as we are aware. The SHIBR dataset is semi-annotated, easing the development of automated and semi-automated machine learning methods for document analysis applications. Furthermore, to the best of our knowledge, the SHIBR dataset is the largest historical handwritten document dataset and the first semi-annotated historical document image dataset with Swedish characters.  3 Challenges and opportunities in historical handwritten documents In the context of handwritten document image analysis, many challenges need to be tackled [30]. Most of the stateof-the-art solutions focus mainly on word spotting and recognition challenges.

Word/pattern spotting
Over the past decade, an enormous collection of handwritten or machine-printed documents have been digitised to preserve the contained information. Word spotting methods are used to extract relevant information from these documents. Generally, word spotting in handwritten document images is much more complex than on machineprinted document images-the former consists of significant variations of handwriting styles and various character types in different languages. As the text in these historical documents was written by different writers, it generates large variations in appearances because of, on the one hand, skewness, curvature, aspect ratio, and size, and because of broken and connected words/characters on the other hand. As a result, these variations in writing styles in different languages may create endless diversities for word spotting in handwritten document images. Many word spotting methods have been proposed for document indexation. They can be classified into two groups: 1) segmentation-based and 2) segmentation-free methods.
Matching is a segmentation-free word spotting approach. It is the process of searching a target or template word image in document images. It is one of the basic approaches that have been applied for word spotting on document images. Moreover, it is mainly employed based on similarity or distance measure between the template word image cropped from a document image and the document images' target region. In [31], a word-level matching scheme is proposed to search a template word image in printed document images using a feature-extraction technique. After that, the extracted features are used for similarity estimation for word spotting. In [32], another word spotting framework is proposed; the dynamic time warping (DTW)-based matching technique. The DTW matching algorithm is applied on machine-printed document images, providing superior results for word spotting. In [33], a block adjacency graph (BAG) method for word spotting is designed and employed based on similarity estimation between the template image and the moving window regions in document images. A word shape coding scheme is proposed by Bai et al. [34] that combines feature descriptors and a matching technique for word spotting in document images. A block-based document image descriptor used for word spotting in historical printed documents based on the template matching process is proposed by Rabaev et al. [35]. Their experiments show that this method provides promising results if the documents do not include too many undesired artefacts. Other word spotting-based matching methods can be found in [36][37][38]. The matching-based word spotting techniques have several drawbacks: 1) they are time-consuming, 2) they cannot overcome undesired artefacts involved in the handwritten documents, and 3) they often yield poor accuracy rates on handwritten text images.
Thus, learning-based segmentation-free word spotting techniques are designed and applied to increase word spotting accuracy. For instance, the Hidden Markov Models (HMMs) technique has been used for word spotting in handwritten documents [39,40]. Besides, hybrid models of HMMs with different supervised learning methods have been developed that combine HMMs with support vector machine (SVM) [41] or with neural network (NN) [42] or with deep convolutional neural network (CNN) [43]. In another work, a new word spotting system for handwritten Urdu language document images is proposed [44]. The method uses several pre-processing steps such as binarization, connected component analysis and edge detection. Subsequently, for word spotting purposes, a sliding window based on an SVM classifier is used to spot Urdu words. In [45], a word spotting and recognition approach based on a common representation of word images and text strings is proposed. The method first extracts the standard features to decrease the dimensional space, and then the nearest neighbour algorithm is used for word spotting. Frinken et al. [46] have designed a novel method based on recurrent neural network (RNN) for word spotting in handwritten documents. In [47], an efficient patch-based framework combined with the scale-invariant feature transform (SIFT) descriptors is proposed for keyword spotting in historical document collections. In [48], a CNN architecture is designed for word spotting in handwritten documents. Extensive survey papers of words spotting methods can be found in [49][50][51][52][53].
SHIBR lends itself well as a challenging benchmark for word spotting methods. Since the SHIBR dataset, like many other historical documents, does not have segmented words, we exclusively look at the segmentation-free methods. These methods may, for example, rely on traditional computer vision to find interesting regions or on region-proposal networks like Faster-RCNN [54]. The Ctrl-F-Mini algorithm fulfils this criterion and has been shown to outperform many existing methods [55]. We, therefore, select Ctrl-F-Mini for the first benchmarks on SHIBR. While its dependence on bounding boxes during training makes it difficult to train on SHIBR, its pre-trained models may be readily evaluated on the dataset. Ctrl-F-Mini is a deep convolutional network model for segmentation-free word spotting. It is related to the Faster-RCNN alternative but uses Dilated Text Proposals (DTP) [55] instead of a Region Proposal Network (RPN) [54] to propose regions of a manuscript page potentially containing words. For each region, the network outputs the estimated probability of it depicting a word and a word embedding. The word embeddings are either the Pyramidal Histogram of Characters (PHOC) or the Discrete Cosine Transform of Words (DCToW) [55] embeddings. In word retrieval, the regions are ranked by their embeddings' cosine similarity to the query string embedding.

Opportunities: genealogy research
Genealogy and the study of family history/tree are both interlinked. In the modern era, and with the success and spread of DNA sequencing in high-throughput genomic sequences, genealogy has become a vivid field. However, genealogists are also interested in unravelling the history of families by mining historical documents. Nevertheless, there exist contentious points that Hatton [56] has rethought in a study examining history, lineage, identity, and technology in relation to genealogy. As depicted in Fig. 2, the top twenty countries most interested in genealogy show diversity, with Sweden coming slightly above the USA. The statistics are retrieved from the Google Trends tool (a website by Google that analyses the popularity of top search queries in Google Search across various regions and languages).

Opportunities: window into the past
A window into the past may allow the investigation of assumptions about the historical position and actual practices and thinking of a particular epoch, presenting genealogists with opportunities to further their theories. For instance, Abildgren, K. explores what a crowdsourced genealogical online database can say about Denmark's income inequality during the First World War [57]. Additionally, Zhu Z. constructs the concept system and relationships of the genealogy ontology and takes Wu Shilai, the ancestor of the 23rd generation of Wu's, as an example to realize the visualisation of traditional Chinese genealogy [58].

SHIBR dataset
This dataset is retrieved from the Arkiv Digital AD AB image and index database. When a child was born in Sweden in the 1800s, he or she was registered by a priest in a church record book called Birth and Christening Records. These priests registered the child's name, when the child was born and baptised, where the child lived, and information about the father and the child's mother, as shown in Table 2. The transcription is based on manual annotation (at Arkiv Digital AD AB and its partners) of scanned images from 1800 to 1840.

Structure of SHIBR
The master dataset (SHIBR m ) consists of 818,110 indexed rows and 64,084 images. This dataset is confidential and can only be used according to the agreement between the company Arkiv Digital AD AB and the Blekinge Tekniska Ho¨gskola (BTH). However, a subset comprising random samples from the period 1800-1840 was determined to abide by the GDPR (the European General Data Protection Regulation) law and, therefore, can be made open access. As such, the public dataset (SHIBR p ) with semi-annotation consists of 10,500, 2250, 2250 images for training, testing and validation, respectively. Hence, in total, the SHIBR p public dataset consists of 15,000 high-resolution (2000 9 1300 to 6000 9 4000) images in RGB (Red, Green and Blue) colour space (* 50 GB of data) that exhibit a variety of layouts, handwriting styles, background colour and degradations. Additionally, SHIBR p is associated with Excel spreadsheets for each of the three folders (training, testing and validation). In total, the spreadsheets contain 191,301 entries.
Description of the index columns: Each image (of the 15,000) corresponds to a double page of a book, and each of these images is associated with an entry in the annotation file (manual transcription) with 17 columns, as shown in Table 2. The structure of the transcribed file, along with a sample image, is exemplified in Fig. 4.

Mining SHIBR m -Statistical insights
Data mining methods aim at discovering frequently occurring patterns in a source dataset [59]. Here, we deploy simple basic statistics to identify potentially helpful information associated with the document images in the SHIBR m pertaining to the nineteenth century's era. The findings listed here are merely examples of the uncharted side of the SHIBR m dataset and of what its public version, SHIBR p , can offer to the research community, especially to genealogists. Please note that the statistics drawn herein are only reflections of the set of data we currently have (i.e., SHIBR m ). In no way should they be taken as a de facto reference to the total population's overall reality during  that era. We only analyse and describe the data that we have.
• Birth rate stratified by county/year: By examining the birth rate, we can see that it has an overall increasing trend and seemingly aligned with the public statistics. 5 When the retrieved data is stratified by the county, we see, at the macro-level, that Ä lvsborgs län exhibits a large birth rate, as shown in Fig. 5. For example, this could be linked to socioeconomic status. • Rate of stillborn (do¨dfo¨dd): Analogues to the birth rate, and probably driven by it (dependent variables), is the death rate which also exhibits an increasing trend at the macro-level. Table 3 shows that Gotlands län tops the list with 1.857% of total newborns, and at the bottom of the list is Norrbottens län with 0.749%. Worth noting is that baby boys consistently exhibit higher death rates than baby girls in the data we have spanning the period 1800 to 1840. We are uncertain if this difference is genuine and descriptive of the total population in that period. However, this finding is consistent with a recent report published by the Statistics Sweden SCB (a Swedish agency) stating that ''infant mortality has been higher for boys than for girls but this difference between the sexes is almost non-existent in the twenty-first century'' and with the Historical Statistics of Sweden [60]. Figure 6 shows the overall mortality rate stratified by gender, and Table 3 tabulates the rate at each county. • Period until baptised: Baptism is a Christian rite for acceptance and adoption into Christianity. It was socially unacceptable not to baptise a child during that era. SHIBR m stores the number of days from birth to baptism. Figure 7 illustrates Ä lvsborgs län as the county with a minor time interval between birth and baptism. Norrbottens län and Västerbottens län top the list, probably because being located in northern Sweden, where a large part of the population is made up of the Sámi people (Laplanders in English) who may have likely had to travel long distances to churches for baptism. • Most common first names (babies, women, men): Another typical trend to look at is the popularity of the first names given to babies and parents. Some of these names are fading away in modern Sweden, allowing for more trendy ones, especially among the young generation. Table 4 shows the top 10 most common names among newborns, fathers and mothers. • Age of women in the birth records: What is the average age of mothers in the birth records of SHIBR m during the period 1800-1840? Fig. 8 depicts a bar chart of the age distribution of women in all counties. More indepth statistics stratified by counties are tabulated in Table 5. • Most typical job titles (women/men): A final factor we look at in SHIBR m data is the job title of men and women during that period. As can be seen from Table 6, most of the men were farmers, military officers, government employees (tax officers), etc. Women during that period were not empowered to have their own jobs, and thus we see a lack of job titles in their column. Maids and housekeepers were the only reported jobs we found in the large dataset SHIBR m . Fig. 4 Example of a scanned book page and the view of a file containing transcribed data (17 columns). See Table 2  Back in the days, the Swedish written language was not standardized like it is today. Most people could not read and write. In 1842 there was a law enforcing that every child must go to primary school in Sweden. Only certain groups of people could read and write, for example, priests. As such, the spelling of words and names could be different across Sweden and between individuals. This asymmetry exhibits heterogeneity in both spelling and writing. Historical texts heterogeneity adds more complexity to perform robust image processing and recognition tasks. However, after the first-round manual transcription of the SHIBR dataset by the company's (Arkiv Digital AB) partners, the company carried out a validity check by Swedish native speakers to improve the transcription quality.

Experiments and results
In this section, we present experimental results of the chosen algorithm's performance, Ctrl-F-Mini [55] (discussed in Sect. 3.1), on page retrieval in SHIBR. In all cases, we use the trained models and implementation made available by the original authors. These models were trained on the ADAM dataset with a learning rate of 0.001, multiplied by 0.1 at every 10,000 steps. After a total of 25,000 steps, the model with the highest hold-out-set score was used. While the Ctrl-F-Mini comes with models trained with different loss functions, we select the ones trained with the cosine loss as they have the best  The SHIBR p scanned pages are converted to greyscale colour space and pre-processed using the model's provided settings for the George Washington dataset. We use an NMS (Non-Maximum Suppression) threshold of 0.4 for the DTP regions for all experiments as this corresponds to the negative-match threshold in Ctrl-F-Mini [55]. We do not use any additional word-likeness filtering, and we do not limit the number of regions per document before predicting the embeddings.

Segmentation-free evaluation of word spotting
The most used metric for evaluating word spotting is the Mean Average Precision (mAP) [61] . For a given query q, the precision at k is: where the relevance indicator r j (q) is 1 if the candidate with rank j matches q and 0 otherwise. Averaging the value over all possible k gives the Average Precision (AP) To finally retrieve the mAP, the AP is averaged over the set of all queries Q for the task: The mAP is calculated with instances of individual words to evaluate word spotting on datasets with bounding boxes. The relevance indicator r k (q) is determined by the word matching to ensure a particular overlap between the retrieved candidate and the ground truth bounding box.  Since SHIBR p does not contain bounding boxes, this is not feasible.
To evaluate a word spotting algorithm on semi-annotated datasets like SHIBR p , we instead propose evaluating the mAP with respect to page retrieval (mAP page ). Instances are therefore scanned pages instead of individual words. The set of queries is chosen to be the set of all words occurring in the text. All queries are performed over all the pages, and the results are then ranked according to their single best match for a given query. Finally, relevance r j (q) is indicated if q occurs on a page at rank j.

Results
As a baseline on the SHIBR p test set, we compute the mAP page with respect to page retrieval (Sect. 5.1) using the Ctrl-F-Mini models trained on the George Washington Dataset [17] and on the IAM Offline Handwriting Dataset [23]. We use the 100 most commonly occurring words in the test set as the set of queries Q. The results, including a random baseline corresponding to randomly ranking all instances for each query, are presented in Table 7.
As might be expected, the models trained on the much larger IAM dataset are consistently slightly better than those trained on the George Washington dataset. However, none of the results significantly outperform the random baseline. We would like to highlight two particular challenges that could contribute to the poor results. First, we acknowledge that none of the queries, Swedish names, are present in the training sets. Previous results have indicated that out-of-vocabulary queries tend to be more difficult for word spotting models [55]. Second, none of the provided models has been exposed to Swedish characters or accounted for in the embeddings. Third, the cursive and connected nature of text lines in our dataset and the variation in the writing style hinder achieving robust results using existing pre-trained models [62]. Building generally applicable word spotting applications requires overcoming these challenges. Furthermore, since real-world applications (e.g., when dealing with big data) also perform page retrieval, the mAP page represents an essential and complementary metric for evaluating word spotting methods.

Conclusion
This paper contributes to the research community by providing an open-access large dataset of historical handwritten documents. These documents result from a continuous effort to digitise church birth books available from parishes across Sweden. The dataset that we provide comprises 15000 high-resolution images in RGB colour space and transcribed files containing a wealth of information spanning a period of time from 1800 to 1840. The  main goal of sharing the SHIBR dataset with the research community is to spark more initiatives to further develop robust document analysis algorithms (e.g., word spotting, document retrieval, character recognition, binarization, layout analysis, etc.) and to promote cross-disciplinary research studies. the SHIBR dataset and for allowing us to make it open access. Finally, we acknowledge the editorial committee's support and the insightful comments and suggestions of the anonymous reviewers.
Funding Open access funding provided by Blekinge Institute of Technology.
Availability of data and material The SHIBR data set will be available publicly as open access at the following permanent link upon acceptance: https://ardisdataset.github.io/SHIBR/

Declaration
Conflict of interest Johan Hall is an employee at the company Arkiv Digital AD AB (Sweden), the provider of this dataset. Agrin Hilmkil is an employee at the company Peltarion AB (Sweden). The rest of the authors declare that they have no conflict of interest.

Ethical Approval
The SHIBR dataset is a subset comprising random samples from the period 1800-1840, which were selected to abide by the GDPR (the European General Data Protection Regulation) law and thus are made open access.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.