SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset

Abstract

This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’ performance. The dataset is called SHIBR (the Swedish Historical Birth Records). The contribution of this paper is twofold. First, we believe it is the first and the largest Swedish dataset of its kind provided as open access (15,000 high-resolution colour images of the era between 1800 and 1840). We also perform some data mining of the dataset to uncover some statistics and facts that might be of interest and use to genealogists. Second, we provide a comprehensive survey of contemporary datasets in the field that are open to the public along with a compact review of word spotting techniques. The word transcription file contains 17 columns of information pertaining to each image (e.g., child’s first name, birth date, date of baptism, father's first/last name, mother’s first/last name, death records, town, job title of the father/mother, etc.). Moreover, we evaluate some deep learning models, pre-trained on two other renowned datasets, for word spotting in SHIBR. However, our dataset proved challenging due to the unique handwriting style. Therefore, the dataset could also be used for competitions dedicated to a large set of document analysis problems, including word spotting.

Introduction

Digitising the past is a way to preserve history, restore deteriorating/uncompleted text, extract facts and information, and help in searching, document retrieval and data mining tasks. The digitisation of books/documents is among the objectives that current digital libraries and electronic government initiatives are putting on the top of their priorities. For example, dozens of universities, research centres and companies in Europe have together started a large-scale EU consortium called IMPACTFootnote 1 (Improving Access to Text) [1, 2]. Among the “Endangered Archives Programme” initiatives of the British Library is the digitisation of manuscripts of the Al-Aqsa Mosque Library, East Jerusalem [3]. This historical collection contains more than a hundred Arabic language titles that span over several Islamic periods from the ninth century CE to the end of the Ottoman rule in Palestine at the beginning of the twentieth century. These books span topics about the Arabic language and literature, logic, math, religion, and Sufism.Footnote 2

An old and still valid way to transcribe historical handwritten documents is to rely on crowdsourcing. It is the practice of gathering information or input into a task by acquiring the services of a large number of people (a.k.a. crowd). It is often characterised by small and short-term deals [4]. In a recent study, crowdsourcing, when combined with contemporary technology, is shown to deliver far more complete and validated data than automated processes alone could produce [5]. As such, the automated process of converting a historical document into a readable text is still posing various challenges. The work herein offers possible assistance in lifting some of these challenges by providing one of the most extensive free-access semi-annotated historical handwritten document datasets.

We conclude this section by noting that this dataset would enrich the availability of historical handwritten document datasets and help develop more accurate algorithms for word spotting, optical character recognition (OCR), document layout analysis and image binarization. It would also serve the research community interested in history and heritage (i.e., genealogists), see Fig. 1. This set of motifs was the ultimate impetus for the creation of the SHIBR dataset (the Swedish Historical Birth Records).Footnote 3 This complete-page dataset (SHIBR) complements the previously published numerical handwritten dataset (ARDIS) [6], both of which are generously provided for free by Arkiv Digital AD AB, a Swedish company.

Fig. 1
figure1

SHIBR and what it serves: per discipline (a) and per community (b)

Review of related public datasets

Different public handwritten document image datasets have been created and presented to resolve various document image challenges such as text line segmentation [7], word spotting [8], writer identification [9], digit and character segmentation and recognition [10,11,12], binarization [13], and a variety of other challenges [14,15,16]. These datasets enable researchers to develop automated and computationally efficient algorithms. Generally, the existing datasets are classified into two groups based on the era in which they were written: historical or modern datasets. The well-known and widely used document databases are listed in Table 1, several of which are described in this section.

Table 1 Annotated handwritten image datasets in different languages that are publicly available

George Washington database (GW) [17, 18]: This database is a baseline database for text line segmentation, word spotting and word recognition tasks. The Washington database consists of 20 historical handwritten document images written in English with longhand script and ink-type pen in the eighteenth century. Moreover, these images are annotated with 656 text lines, 4894-word instances, 1471-word classes and 82 letters.

IAM database [22, 23]: The IAM database contains 1539 handwritten modern document images written by 657 different English scriptwriters. The documents were scanned at a resolution of 300dpi and stored in greyscale colour to create this database. The document images are labelled using an automatic segmentation approach and are verified visually. The database consists of 5685 isolated and labelled sentences, 13,353 isolated and labelled text lines, 115,320 isolated and labelled words.

VML-HD database [24]: This database includes 668 handwritten document images. The documents in the database were written in Arabic by different writers between the years 1088 and 1451. All the words and characters in the document images are manually annotated with bounding boxes. As a result, this database consists of 159,149 annotated words and 326,289 annotated characters.

HADARA80P database [25]: This database contains 80 handwritten document images written in the Arabic language. This database is used for word segmentation problems, as the words in the document images are annotated with polygons.

Esposalles database [27, 28]: The Esposalles database is a Spanish historical handwriting document image database consisting of 173 document images. The documents were written between 1451 and 1905, and they contain information from the marriage licenses of Spanish citizens. These documents are collected from different books available at the archives of the Cathedral of Barcelona. All text blocks, lines, and transcriptions in the document images are manually labelled. Furthermore, this database has been used to develop handwriting recognition algorithms.

Other handwritten document image databases have also been created and can be found with more details in [19, 21, 26, 29]. Furthermore, the proposed SHIBR document dataset is comprehensively scrutinised and described in Sect. 4.

Limitations of existing document images databases

Even though some of the existing datasets have annotations, most of them have several limitations: (1) scarcity of a large number of document images; (2) lack of datasets with Swedish characters; (3) lack of availability of historical documents written in Swedish handwriting styles with various types of dip pens; and (4) lack of availability of datasets with significant variations of artefacts (e.g., degradation, bleed-through, ink leakage etc.). For instance, the George Washington dataset contains the least number of document datasets with 20 English document images. In contrast, the IFN/ENIT dataset, the most extensive document image database (see Table 1), consists of 6735 binary Arabic document images. Therefore, the main challenges when using these datasets for historical document image analysis are (1) dealing with a small number of document images which leads to small intra- and inter-class variations, (2) exhibiting a small number of artefacts, and (3) covering historical ancient handwriting styles insufficiently. Therefore, to support the development of research in historical document image analysis, it is essential to construct a new dataset that would address the shortcomings of the existing datasets. Thus, this paper proposes a new and large dataset (SHIBR) containing 15000 document images, which is the largest of its kind as far as we are aware. The SHIBR dataset is semi-annotated, easing the development of automated and semi-automated machine learning methods for document analysis applications. Furthermore, to the best of our knowledge, the SHIBR dataset is the largest historical handwritten document dataset and the first semi-annotated historical document image dataset with Swedish characters.

Challenges and opportunities in historical handwritten documents

In the context of handwritten document image analysis, many challenges need to be tackled [30]. Most of the state-of-the-art solutions focus mainly on word spotting and recognition challenges.

Word/pattern spotting

Over the past decade, an enormous collection of handwritten or machine-printed documents have been digitised to preserve the contained information. Word spotting methods are used to extract relevant information from these documents. Generally, word spotting in handwritten document images is much more complex than on machine-printed document images—the former consists of significant variations of handwriting styles and various character types in different languages. As the text in these historical documents was written by different writers, it generates large variations in appearances because of, on the one hand, skewness, curvature, aspect ratio, and size, and because of broken and connected words/characters on the other hand. As a result, these variations in writing styles in different languages may create endless diversities for word spotting in handwritten document images. Many word spotting methods have been proposed for document indexation. They can be classified into two groups: 1) segmentation-based and 2) segmentation-free methods.

Matching is a segmentation-free word spotting approach. It is the process of searching a target or template word image in document images. It is one of the basic approaches that have been applied for word spotting on document images. Moreover, it is mainly employed based on similarity or distance measure between the template word image cropped from a document image and the document images’ target region. In [31], a word-level matching scheme is proposed to search a template word image in printed document images using a feature-extraction technique. After that, the extracted features are used for similarity estimation for word spotting. In [32], another word spotting framework is proposed; the dynamic time warping (DTW)-based matching technique. The DTW matching algorithm is applied on machine-printed document images, providing superior results for word spotting. In [33], a block adjacency graph (BAG) method for word spotting is designed and employed based on similarity estimation between the template image and the moving window regions in document images. A word shape coding scheme is proposed by Bai et al. [34] that combines feature descriptors and a matching technique for word spotting in document images. A block-based document image descriptor used for word spotting in historical printed documents based on the template matching process is proposed by Rabaev et al. [35]. Their experiments show that this method provides promising results if the documents do not include too many undesired artefacts. Other word spotting-based matching methods can be found in [36,37,38]. The matching-based word spotting techniques have several drawbacks: 1) they are time-consuming, 2) they cannot overcome undesired artefacts involved in the handwritten documents, and 3) they often yield poor accuracy rates on handwritten text images.

Thus, learning-based segmentation-free word spotting techniques are designed and applied to increase word spotting accuracy. For instance, the Hidden Markov Models (HMMs) technique has been used for word spotting in handwritten documents [39, 40]. Besides, hybrid models of HMMs with different supervised learning methods have been developed that combine HMMs with support vector machine (SVM) [41] or with neural network (NN) [42] or with deep convolutional neural network (CNN) [43]. In another work, a new word spotting system for handwritten Urdu language document images is proposed [44]. The method uses several pre-processing steps such as binarization, connected component analysis and edge detection. Subsequently, for word spotting purposes, a sliding window based on an SVM classifier is used to spot Urdu words. In [45], a word spotting and recognition approach based on a common representation of word images and text strings is proposed. The method first extracts the standard features to decrease the dimensional space, and then the nearest neighbour algorithm is used for word spotting. Frinken et al. [46] have designed a novel method based on recurrent neural network (RNN) for word spotting in handwritten documents. In [47], an efficient patch-based framework combined with the scale-invariant feature transform (SIFT) descriptors is proposed for keyword spotting in historical document collections. In [48], a CNN architecture is designed for word spotting in handwritten documents. Extensive survey papers of words spotting methods can be found in [49,50,51,52,53].

SHIBR lends itself well as a challenging benchmark for word spotting methods. Since the SHIBR dataset, like many other historical documents, does not have segmented words, we exclusively look at the segmentation-free methods. These methods may, for example, rely on traditional computer vision to find interesting regions or on region-proposal networks like Faster-RCNN [54]. The Ctrl-F-Mini algorithm fulfils this criterion and has been shown to outperform many existing methods [55]. We, therefore, select Ctrl-F-Mini for the first benchmarks on SHIBR. While its dependence on bounding boxes during training makes it difficult to train on SHIBR, its pre-trained models may be readily evaluated on the dataset. Ctrl-F-Mini is a deep convolutional network model for segmentation-free word spotting. It is related to the Faster-RCNN alternative but uses Dilated Text Proposals (DTP) [55] instead of a Region Proposal Network (RPN) [54] to propose regions of a manuscript page potentially containing words. For each region, the network outputs the estimated probability of it depicting a word and a word embedding. The word embeddings are either the Pyramidal Histogram of Characters (PHOC) or the Discrete Cosine Transform of Words (DCToW) [55] embeddings. In word retrieval, the regions are ranked by their embeddings’ cosine similarity to the query string embedding.

Opportunities: genealogy research

Genealogy and the study of family history/tree are both interlinked. In the modern era, and with the success and spread of DNA sequencing in high-throughput genomic sequences, genealogy has become a vivid field. However, genealogists are also interested in unravelling the history of families by mining historical documents. Nevertheless, there exist contentious points that Hatton [56] has rethought in a study examining history, lineage, identity, and technology in relation to genealogy. As depicted in Fig. 2, the top twenty countries most interested in genealogy show diversity, with Sweden coming slightly above the USA. The statistics are retrieved from the Google Trends tool (a website by Google that analyses the popularity of top search queries in Google Search across various regions and languages).

Fig. 2
figure2

Top twenty countries most interested in genealogy as recorded by the Google Trends tool between 2004–2020

Opportunities: window into the past

A window into the past may allow the investigation of assumptions about the historical position and actual practices and thinking of a particular epoch, presenting genealogists with opportunities to further their theories. For instance, Abildgren, K. explores what a crowdsourced genealogical online database can say about Denmark’s income inequality during the First World War [57]. Additionally, Zhu Z. constructs the concept system and relationships of the genealogy ontology and takes Wu Shilai, the ancestor of the 23rd generation of Wu’s, as an example to realize the visualisation of traditional Chinese genealogy [58].

SHIBR dataset

This dataset is retrieved from the Arkiv Digital AD AB image and index database. When a child was born in Sweden in the 1800s, he or she was registered by a priest in a church record book called Birth and Christening Records. These priests registered the child’s name, when the child was born and baptised, where the child lived, and information about the father and the child’s mother, as shown in Table 2. The transcription is based on manual annotation (at Arkiv Digital AD AB and its partners) of scanned images from 1800 to 1840.

Table 2 Columns in SHIBRp’s accompanying transcribed file

Structure of SHIBR

The master dataset (SHIBRm) consists of 818,110 indexed rows and 64,084 images. This dataset is confidential and can only be used according to the agreement between the company Arkiv Digital AD AB and the Blekinge Tekniska Högskola (BTH). However, a subset comprising random samples from the period 1800–1840 was determined to abide by the GDPR (the European General Data Protection Regulation) law and, therefore, can be made open access. As such, the public dataset (SHIBRp) with semi-annotation consists of 10,500, 2250, 2250 images for training, testing and validation, respectively. Hence, in total, the SHIBRp public dataset consists of 15,000 high-resolution (2000 × 1300 to 6000 × 4000) images in RGB (Red, Green and Blue) colour space (~ 50 GB of data) that exhibit a variety of layouts, handwriting styles, background colour and degradations. Additionally, SHIBRp is associated with Excel spreadsheets for each of the three folders (training, testing and validation). In total, the spreadsheets contain 191,301 entries.

Swedish counties (län) covered: The counties that are recorded in these books are as follows. Gotland, Gävleborg, Norrbotten, Västerbotten, Västernorrland, Västmanland, Älvsborg,Footnote 4 and Örebro (see Fig. 3).

Fig. 3
figure3

Source: Familysearch.org], showing some of the Swedish counties reported in the SHIBR dataset

Map of Sweden, as before the year 1997 [

Description of the index columns: Each image (of the 15,000) corresponds to a double page of a book, and each of these images is associated with an entry in the annotation file (manual transcription) with 17 columns, as shown in Table 2. The structure of the transcribed file, along with a sample image, is exemplified in Fig. 4.

Fig. 4
figure4

Example of a scanned book page and the view of a file containing transcribed data (17 columns). See Table 2 above for descriptions of columns 1–17

Mining SHIBRm – Statistical insights

Data mining methods aim at discovering frequently occurring patterns in a source dataset [59]. Here, we deploy simple basic statistics to identify potentially helpful information associated with the document images in the SHIBRm pertaining to the nineteenth century’s era. The findings listed here are merely examples of the uncharted side of the SHIBRm dataset and of what its public version, SHIBRp, can offer to the research community, especially to genealogists. Please note that the statistics drawn herein are only reflections of the set of data we currently have (i.e., SHIBRm). In no way should they be taken as a de facto reference to the total population’s overall reality during that era. We only analyse and describe the data that we have.

  • Birth rate stratified by county/year: By examining the birth rate, we can see that it has an overall increasing trend and seemingly aligned with the public statistics.Footnote 5 When the retrieved data is stratified by the county, we see, at the macro-level, that Älvsborgs län exhibits a large birth rate, as shown in Fig. 5. For example, this could be linked to socioeconomic status.

  • Rate of stillborn (dödfödd): Analogues to the birth rate, and probably driven by it (dependent variables), is the death rate which also exhibits an increasing trend at the macro-level. Table 3 shows that Gotlands län tops the list with 1.857% of total newborns, and at the bottom of the list is Norrbottens län with 0.749%. Worth noting is that baby boys consistently exhibit higher death rates than baby girls in the data we have spanning the period 1800 to 1840. We are uncertain if this difference is genuine and descriptive of the total population in that period. However, this finding is consistent with a recent report published by the Statistics Sweden SCB (a Swedish agency) stating that “infant mortality has been higher for boys than for girls but this difference between the sexes is almost non-existent in the twenty-first century” and with the Historical Statistics of Sweden [60]. Figure 6 shows the overall mortality rate stratified by gender, and Table 3 tabulates the rate at each county.

  • Period until baptised: Baptism is a Christian rite for acceptance and adoption into Christianity. It was socially unacceptable not to baptise a child during that era. SHIBRm stores the number of days from birth to baptism. Figure 7 illustrates Älvsborgs län as the county with a minor time interval between birth and baptism. Norrbottens län and Västerbottens län top the list, probably because being located in northern Sweden, where a large part of the population is made up of the Sámi people (Laplanders in English) who may have likely had to travel long distances to churches for baptism.

  • Most common first names (babies, women, men): Another typical trend to look at is the popularity of the first names given to babies and parents. Some of these names are fading away in modern Sweden, allowing for more trendy ones, especially among the young generation. Table 4 shows the top 10 most common names among newborns, fathers and mothers.

  • Age of women in the birth records: What is the average age of mothers in the birth records of SHIBRm during the period 1800–1840? Fig. 8 depicts a bar chart of the age distribution of women in all counties. More in-depth statistics stratified by counties are tabulated in Table 5.

  • Most typical job titles (women/men): A final factor we look at in SHIBRm data is the job title of men and women during that period. As can be seen from Table 6, most of the men were farmers, military officers, government employees (tax officers), etc. Women during that period were not empowered to have their own jobs, and thus we see a lack of job titles in their column. Maids and housekeepers were the only reported jobs we found in the large dataset SHIBRm.

Fig. 5
figure5

Birth rate aggregated from the eight counties (right) and stratified by county (left)

Table 3 Stillborn rate stratified by counties
Fig. 6
figure6

Stillborn rate in all counties

Fig. 7
figure7

Days between a baby is born until being baptised, stratified by counties

Table 4 Ten most common names in Sweden in the period between 1800 and 1840
Fig. 8
figure8

Overall age distribution of women in the birth records (1800–1840)

Table 5 Age statistics of women in the birth records stratified by counties (1800–1840)
Table 6 The most common job titles for men/women during the period (1800–1840)

Back in the days, the Swedish written language was not standardized like it is today. Most people could not read and write. In 1842 there was a law enforcing that every child must go to primary school in Sweden. Only certain groups of people could read and write, for example, priests. As such, the spelling of words and names could be different across Sweden and between individuals. This asymmetry exhibits heterogeneity in both spelling and writing. Historical texts heterogeneity adds more complexity to perform robust image processing and recognition tasks. However, after the first-round manual transcription of the SHIBR dataset by the company’s (Arkiv Digital AB) partners, the company carried out a validity check by Swedish native speakers to improve the transcription quality.

Experiments and results

In this section, we present experimental results of the chosen algorithm’s performance, Ctrl-F-Mini [55] (discussed in Sect. 3.1), on page retrieval in SHIBR. In all cases, we use the trained models and implementation made available by the original authors. These models were trained on the ADAM dataset with a learning rate of 0.001, multiplied by 0.1 at every 10,000 steps. After a total of 25,000 steps, the model with the highest hold-out-set score was used. While the Ctrl-F-Mini comes with models trained with different loss functions, we select the ones trained with the cosine loss as they have the best benchmark results. However, we evaluate the PHOC and the DCToW embedding variants. Finally, since one model is provided per fold, we always select the one corresponding to the first fold.

The SHIBRp scanned pages are converted to greyscale colour space and pre-processed using the model’s provided settings for the George Washington dataset. We use an NMS (Non-Maximum Suppression) threshold of 0.4 for the DTP regions for all experiments as this corresponds to the negative-match threshold in Ctrl-F-Mini [55]. We do not use any additional word-likeness filtering, and we do not limit the number of regions per document before predicting the embeddings.

Segmentation-free evaluation of word spotting

The most used metric for evaluating word spotting is the Mean Average Precision (mAP) [61] . For a given query q, the precision at k is:

$$P_{k} \left( q \right) = \frac{{\left| {\left\{ {{\text{top}}~k~{\text{candidates}}~{\text{for}}~q} \right\} \cap \left\{ {{\text{instances}}~{\text{of}}~q} \right\}} \right|}}{k} = \frac{1}{k}\mathop \sum \limits_{{j = 1}}^{k} r_{j} \left( q \right)$$
(1)

where the relevance indicator rj(q) is 1 if the candidate with rank j matches q and 0 otherwise. Averaging the value over all possible k gives the Average Precision (AP)

$${\text{AP}}\left( q \right) = \frac{{\mathop \sum \nolimits_{{k = 1}}^{n} P_{k} \left( q \right)r_{k} \left( q \right)}}{{\left| {\left\{ {{\text{instances~of}}~q} \right\}} \right|}}~~$$
(2)

To finally retrieve the mAP, the AP is averaged over the set of all queries \(Q\) for the task:

$$~~{\text{mAP}}\left( Q \right) = \frac{1}{{\left| Q \right|}}\mathop \sum \limits_{{q \in Q}} {\text{AP}}\left( q \right)$$
(3)

The mAP is calculated with instances of individual words to evaluate word spotting on datasets with bounding boxes. The relevance indicator rk(q) is determined by the word matching to ensure a particular overlap between the retrieved candidate and the ground truth bounding box. Since SHIBRp does not contain bounding boxes, this is not feasible.

To evaluate a word spotting algorithm on semi-annotated datasets like SHIBRp, we instead propose evaluating the mAP with respect to page retrieval (mAPpage). Instances are therefore scanned pages instead of individual words. The set of queries is chosen to be the set of all words occurring in the text. All queries are performed over all the pages, and the results are then ranked according to their single best match for a given query. Finally, relevance rj(q) is indicated if q occurs on a page at rank j.

Results

As a baseline on the SHIBRp test set, we compute the mAPpage with respect to page retrieval (Sect. 5.1) using the Ctrl-F-Mini models trained on the George Washington Dataset [17] and on the IAM Offline Handwriting Dataset [23]. We use the 100 most commonly occurring words in the test set as the set of queries \(Q\). The results, including a random baseline corresponding to randomly ranking all instances for each query, are presented in Table 7.

Table 7 mAPpage with respect to page retrieval on the SHIBRp test set described in Sects. 5.1 and 5.2. The result for the random baseline is the mean ± standard deviation over 100 trials

As might be expected, the models trained on the much larger IAM dataset are consistently slightly better than those trained on the George Washington dataset. However, none of the results significantly outperform the random baseline. We would like to highlight two particular challenges that could contribute to the poor results. First, we acknowledge that none of the queries, Swedish names, are present in the training sets. Previous results have indicated that out-of-vocabulary queries tend to be more difficult for word spotting models [55]. Second, none of the provided models has been exposed to Swedish characters or accounted for in the embeddings. Third, the cursive and connected nature of text lines in our dataset and the variation in the writing style hinder achieving robust results using existing pre-trained models [62]. Building generally applicable word spotting applications requires overcoming these challenges. Furthermore, since real-world applications (e.g., when dealing with big data) also perform page retrieval, the mAPpage represents an essential and complementary metric for evaluating word spotting methods.

Conclusion

This paper contributes to the research community by providing an open-access large dataset of historical handwritten documents. These documents result from a continuous effort to digitise church birth books available from parishes across Sweden. The dataset that we provide comprises 15000 high-resolution images in RGB colour space and transcribed files containing a wealth of information spanning a period of time from 1800 to 1840. The main goal of sharing the SHIBR dataset with the research community is to spark more initiatives to further develop robust document analysis algorithms (e.g., word spotting, document retrieval, character recognition, binarization, layout analysis, etc.) and to promote cross-disciplinary research studies.

Availability of data and material

The SHIBR data set will be available publicly as open access at the following permanent link upon acceptance: https://ardisdataset.github.io/SHIBR/

Notes

  1. 1.

    IMPACT: http://www.impact-project.eu/home/, [Online], accessed on 2020–04-11.

  2. 2.

    https://eap.bl.uk/collection/EAP521-1, [Online], accessed on 2020–04-11.

  3. 3.

    The SHIBR data set is open access available from the following link: https://ardisdataset.github.io/SHIBR/.

  4. 4.

    Älvsborg county was a county of Sweden until 1997 when it was merged with the counties of Göteborg, Bohus and Skaraborg to form Västra Götaland County. The county consisted of the provinces of Dalsland and the central part of Västergötland with the seat of residence in the city of Vänersborg.

  5. 5.

    Sveriges folkmängd från 1749 och fram till idag [Sweden’s population from 1749 until today] by the National Central Bureau of Statistics (SCB), 2017–10-27. Available from: https://www.scb.se/.

References

  1. 1.

    H Balk, A Conteh (2011) IMPACT: centre of competence in text digitisation. In: Proceedings of the 2011 workshop on historical document imaging and processing (pp. 155–160)

  2. 2.

    H Balk (2009) Poor access to digitised historical texts: the solutions of the IMPACT project. In: Proceedings of the third workshop on analytics for noisy unstructured text data (pp. 1–1)

  3. 3.

    M Krystyna, AH Qasem (2009) Digitizing the historical periodical collection at the Al-Aqsa Mosque Library in East Jerusalem. In: Proceedings IFLA world library and information Congress, Milan, Italy, August 24

  4. 4.

    Z Zakariah, N Janom, NH Arshad, SS Salleh, SRS Aris (2014) Crowdsourcing: the trend of prior studies. In: Proceedings of the 2014 4th international conference on artificial intelligence with applications in engineering and technology (ICAIET’14). IEEE computer society, USA, 129–133. DOI: https://doi.org/10.1109/ICAIET.2014.30

  5. 5.

    C Clausner, J Hayes, A Antonacopoulos (2019) Crowdsourcing historical tabular data: 1961 Census of England and Wales. In: Proceedings of the 5th international workshop on historical document imaging and processing (HIP’19). Association for Computing Machinery, New York, NY, USA, 42–47. DOI: https://doi.org/10.1145/3352631.3352643.

  6. 6.

    Kusetogullari H, Yavariabdi A, Cheddad A et al (2019) ARDIS: a Swedish historical handwritten digit dataset. Neural Comput Applic. https://doi.org/10.1007/s00521-019-04163-3

    Article  Google Scholar 

  7. 7.

    A Sanchez, PD Suarez, CAB Mello, ALI Oliveira , VMO Alves (2008) Text line segmentation in images of handwritten historical documents. In: Proceedings of the 2008 first workshops on image processing theory, tools and applications, Sousse, (pp. 1–6)

  8. 8.

    Zagoris K, Pratikakis I, Gatos B (2017) Unsupervised word spotting in historical handwritten document images using document-oriented local features. IEEE Trans Image Process 26(8):4032–4041. https://doi.org/10.1109/TIP.2017.2700721

    MathSciNet  Article  MATH  Google Scholar 

  9. 9.

    C Djeddi, S Al-Maadeed, A Gattal, I Siddiqi, A Ennaji, HE Abed (2016) ICFHR2016 competition on multi-script writer demographics classification using “QUWI” database. In: Proceedings of the IEEE international conference on frontiers in handwriting recognition, (pp. 602–606)

  10. 10.

    Ahlawat S, Choudhary A (2020) Hybrid CNN-SVM classifier for handwritten digit recognition. Procedia Computer Science 167:2554–2560

    Article  Google Scholar 

  11. 11.

    R Alaasam, B Kurar, M Kassis , J El-Sana (2017) Experiment study on utilizing convolutional neural networks to recognize historical Arabic handwritten text. In: Proceedings of the 2017 1st international workshop on Arabic script analysis and recognition (ASAR), Nancy, (pp. 124–128)

  12. 12.

    Ribas FC, Oliveira LS, Britto AS, Sabourin R (2013) Handwritten digit segmentation: a comparative study. Int J Doc Anal Recognit 16:567–578

    Article  Google Scholar 

  13. 13.

    Ntirogiannis K, Gatos B, Pratikakis I (2014) A combined approach for the binarization of handwritten document images. Pattern Recogn Lett 35:3–15

    Article  Google Scholar 

  14. 14.

    DJ Kennard, AM Kent, WA Barrett (2011) Linking the past: discovering historical social networks from documents and linking to a genealogical database. In: Proceedings of the 2011 workshop on historical document imaging and processing (HIP 2011), New York, USA, (pp. 43–50)

  15. 15.

    DW Embley, S Machado, T Packer, J Park, A Zitzelberger ,SW Liddle, N Tate, DW Lonsdale (2011) Enabling search for facts and implied facts in historical documents. In: Proceedings 2011 workshop on historical document imaging and processing (HIP 2011), New York, USA, (pp. 59–66)

  16. 16.

    S Athenikos, (2009) WikiPhiloSofia and PanAnthropon: extraction and visualization of facts, relations, and networks for a digital humanities knowledge portal. In: Proceedings of the 20th ACM conference hypertext and hypermedia (Hypertext 2009), Torino, Italy, 2009

  17. 17.

    The Washington Database, Retrieved on 2020–06–20, from: http://www.fki.inf.unibe.ch/databases/iam-historical-document-database/washington-database

  18. 18.

    G Washington, George Washington Papers, Series 2, Letterbooks 1754 to 1799: Letterbook 1- Dec. 25, 1755. [Manuscript/Mixed Material] Retrieved from the Library of Congress. https://www.loc.gov/item/mgw2.001/

  19. 19.

    Sarkar R, Das N, Basu S et al (2012) CMATERdb1: a database of unconstrained handwritten bangla and bangla-English mixed script document image. IJDAR 15:71–83

    Article  Google Scholar 

  20. 20.

    Handwritten Keyword Spotting Competition (H-KWS /ICFHR 2016), Retrieved on 2020–06–20, from: https://www.prhlt.upv.es/contests/icfhr2016-kws/data.html

  21. 21.

    ICFHR2016 Competitions, Retrieved on 2020–06–05, from: http://www.nlpr.ia.ac.cn/icfhr2016/competitions.htm

  22. 22.

    The IAM Handwriting Database, Retrieved on 2020–06–20, from: http://www.iam.unibe.ch/fki/databases/iam-handwriting-database

  23. 23

    Marti U, Bunke H (2002) The IAM-database: an english sentence database for off-line handwriting recognition. Int J Doc Anal Recognit 5:39–46

    Article  Google Scholar 

  24. 24.

    M Kassis (2018) VML-HD: The historical Arabic documents dataset for recognition systems (VML-HD). 1, ID: VML-HD1, URL: http://tc11.cvc.uab.es/datasets/VML-HD_1.

  25. 25.

    W Pantke, M Dennhardt, D Fecker, V Märgner T Fingscheidt (2014) An Historical handwritten Arabic dataset for segmentation-free word spotting - HADARA80P. In: Proceedings of the 14th international conference on frontiers in handwriting recognition, Heraklion, (pp. 15–20). doi: https://doi.org/10.1109/ICFHR.2014.11

  26. 26.

    B Kiessling, DS Ben Ezra, MT Miller BADAM, A public dataset for baseline detection in Arabic-script manuscripts. In Proceedings of the 5th International Workshop on Historical Document Imaging and Processing (HIP’19), ACM, 13–18. DOI: https://doi.org/10.1145/3352631.3352648.

  27. 27.

    The ESPOSALLES Database, Retrieved on 2020–06–20, from: http://dag.cvc.uab.es/the-esposalles-database/

  28. 28.

    Romero V, Fornés A, Serrano N, Sánchez JA, Toselli AH, Frinken V, Vidal E, Lladós J (2013) The ESPOSALLES database: an ancient marriage license corpus for off-line handwriting recognition. Pattern Recogn 46:1658–1669

    Article  Google Scholar 

  29. 29.

    The IFN/ENIT-database, Retrieved on 2020–06–20, from: http://www.ifnenit.com/download.htm

  30. 30

    Hussain R, Raza A, Siddiqi I et al (2015) A comprehensive survey of handwritten document benchmarks: structure, usage and evaluation. J Image Video Proc. 46(1):1–24

    Google Scholar 

  31. 31.

    T Rath, R Manmatha (2003) Features for word spotting in historical manuscripts. In: Proceedings of the 7th international conference on document analysis and recognition (ICDAR), (pp. 218–222)

  32. 32.

    T Mondal, N Ragot, JY Ramel, U Pal (2015) Performance evaluation of DTW and its variants for word spotting in degraded documents. In: Proceedings of the 13th international conference on document analysis and recognition (ICDAR), (pp. 1141–1145)

  33. 33.

    Bhardwaj A, Setlur S, Govindaraju V (2009) Keyword spotting techniques for Sanskrit documents. In: Huet G, Kulkarni A, Scharf P (eds) Lecture Notes in Artificial Intelligence 5402. Springer, Berlin, pp 403–416

    Google Scholar 

  34. 34.

    E Ataer, P Duygulu (2006) Retrieval of ottoman documents. In: Proceedings of the 8th ACM International workshop on multimedia information retrieval, (pp. 155–162)

  35. 35.

    I Rabaev, I Dinstein, J El-Sana, K Kedem (2014) Segmentation-free keyword retrieval in historical document images. In: A Campilho, M Kamel (eds) Image analysis and recognition ICIAR 2014. Lecture notes in computer science, Springer

  36. 36.

    Leydier Y, Lebourgeois F, Emptoz H (2007) Text search for medieval manuscript images. Pattern Recogn 40:3552–3567

    Article  Google Scholar 

  37. 37.

    V Mane, L, Ragha (2009) Handwritten character recognition using elastic matching and PCA. In: Proceedings of the Int. Conf. Adv Comput, Commun Control, (pp. 410–415)

  38. 38.

    Y Lu, CL Tan (2002) Word searching in document images using word portion matching. In: Proceedings of the international workshop on document analysis systems (DAS 2002), Springer-Verlag, Berlin, Heidelberg, LNCS 2423, (pp. 319–328, 2002)

  39. 39.

    A Fischer, A Keller, V Frinken, H Bunke (2010. HMM-based word spotting in handwritten documents using subword models. In: Proceedings of the 20th international conference on pattern recognition (ICPR), IEEE, (pp. 3416–3419)

  40. 40.

    Bianne-Bernard AL, Menasri F, Mohamad RH, Mokbel C, Kermorvant C, Likforman-Sulem L (2011) Dynamic and contextual information in HMM modeling for handwritten word recognition. IEEE Trans Pattern Anal Mach Intell 33(10):2066–2080

    Article  Google Scholar 

  41. 41.

    A Ahmad, C Viard-Gaudin, M Khalid (2009) Lexicon-based word recognition using support vector machine and hidden Markov model. In: Proceedings of the 10th international conference on document analysis and recognition (ICDAR),(pp. 161–165)

  42. 42.

    Espana-Boquera S, Castro-Bleda M, Gorbe-Moya J, Zamora-Martinez F (2011) Improving offline handwritten text recognition with hybrid HMM/ANN models. IEEE Trans Pattern Anal Mach Intell 33(4):767–779

    Article  Google Scholar 

  43. 43.

    A C Rouhou, YK Kanoun (2019) Hybrid HMM/DNN system for Arabic handwriting keyword spotting. In: Proceedings of the 16th international conference on image analysis and recognition, Springer, Canada, (pp. 216–227), August 27–29. DOI: https://doi.org/10.1007/978-3-030-27202-9_19

  44. 44.

    MW Sagheer, N Nobile, CL. He , CY Suen (2010) A novel handwritten Urdu word spotting based on connected components analysis. In: Proceedings of the 20th international conference on pattern recognition, Istanbul, (pp. 2013–2016). Doi: https://doi.org/10.1109/ICPR.2010.496

  45. 45.

    J Almazán, A Gordo, A Fornés, E Valveny (2013) Handwritten word spotting with corrected attributes. In: Proceedings of the IEEE international conference on computer Vision, Sydney, Australia. (pp.1017–1024). DOI: https://doi.org/10.1109/ICCV.2013.130

  46. 46.

    Frinken V, Fischer A, Manmatha R, Bunke H (2012) A novel word spotting method based on recurrent neural networks. IEEE Trans Pattern Anal Mach Intell 34(2):211–224

    Article  Google Scholar 

  47. 47.

    Krishnan P, Jawahar CV (2019) HWNet v2: an efficient word image representation for handwritten documents. IJDAR 22:387–405

    Article  Google Scholar 

  48. 48.

    S Sudholt , GA Fink (2016) PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: Proceedings of the 15th international conference on frontiers in handwriting recognition (ICFHR), Shenzhen, (pp. 277–282). DOI: https://doi.org/10.1109/ICFHR.2016.0060

  49. 49

    Ahmed R, Al-Khatib WG, Mahmoud S (2017) A Survey on handwritten documents word spotting. Int J Multimed Info Retr 6:31–47

    Article  Google Scholar 

  50. 50.

    Ali AAA, Suresha M (2020) Survey on Segmentation and Recognition of Handwritten Arabic Script. SN COMPUT SCI 1:192

    Article  Google Scholar 

  51. 51.

    Rath T, Manmatha R (2007) Word spotting for historical documents. IJDAR 9(2–4):139–152

    Article  Google Scholar 

  52. 52.

    Murugappan A, Ramachandran B, Dhavachelvan P (2011) A survey of keyword spotting techniques for printed document images. Artif Intell Rev 35(2):119–136

    Article  Google Scholar 

  53. 53.

    M Boualam, G Khaissidi, M Mrabti, Y Elfakir (2019) An overview on handwritten documents word spotting. In: Proceedings of the international conference on wireless technologies, embedded and intelligent systems (WITS), 3–4 April 2019

  54. 54.

    S Ren, KHe, R Girshick, J Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: C Cortes, ND Lawrence, DD Lee, M Sugiyama, R Garnett (eds.), advances in neural information processing systems 28 (Curran Associates, Inc.) (pp. 91–99)

  55. 55.

    T Wilkinson, J Lindström, A Brun (2018) Neural word search in historical manuscript collections. arXiv preprintarXiv:1812.02771

  56. 56.

    Hatton SB (2019) History, kinship, identity, and technology: toward answering the question “what is (family) genealogy?” Genealogy 3(1):2. https://doi.org/10.3390/genealogy3010002

    Article  Google Scholar 

  57. 57.

    Abildgren K (2019) Mining archival genealogy databases to gain new insights into broader historical issues. Digit Libr Perspect 35(3/4):259–270. https://doi.org/10.1108/DLP-07-2019-0025

    Article  Google Scholar 

  58. 58.

    Z Zhu (2020) Content mining and visualization of traditional genealogies of China – Deployed on the genealogy of Wu’s in Gaoqian, Zhejiang. In: Proceedings of the iconference 2020 sustainable digital communities proceedings. March 23 – 27, Borås, Sweden

  59. 59.

    Wojciechowski M, Zakrzewicz M (2002) Dataset filtering techniques in constraint-based frequent pattern mining. In: Hand DJ, Adams NM, Bolton RJ (eds) Pattern detection and discovery lecture notes in computer science. Springer, Berlin

    MATH  Google Scholar 

  60. 60.

    Statistiska Centralbyrån [National Central Bureau of Statistics]. (1969). Historical Statistics of Sweden: Part 1. Population 1720–1967, Stockholm (2nd edition). Available fromhttp://share.scb.se/OV9993/Data/Historisk%20statistik/Historisk%20statistik%20f%C3%B6r%20Sverige%201700-1900-tal/Del1-Befolkning-1720-1967.pdf

  61. 61.

    Giotis AP, Sfikas G, Gatos B, Nikou C (2017) A survey of document image word spotting techniques. Pattern Recogn 68:310–332

    Article  Google Scholar 

  62. 62.

    A Cheddad (2016) Towards query by text example for pattern spotting in historical documents. In: Proceedings of the 7th international conference on computer science and information technology (CSIT), 13–14 July 2016 Amman, Jordan, (pp. 1–6), doi: https://doi.org/10.1109/CSIT.2016.7549479.

Download references

Acknowledgements

This project is funded by the research project “DocPRESERV: Preserving & Processing Historical Document Images with Artificial Intelligence”, STINT, the Swedish Foundation for International Cooperation in Research and Higher Education (Grant: AF2020-8892) and by the research project “Scalable Resource Efficient Systems for Big Data Analytics”, the Knowledge Foundation (Grant: 20140032) in Sweden. We also acknowledge the support of the Swedish company, Arkiv Digital AD AB, for providing the SHIBR dataset and for allowing us to make it open access. Finally, we acknowledge the editorial committee's support and the insightful comments and suggestions of the anonymous reviewers.

Funding

Open access funding provided by Blekinge Institute of Technology.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Abbas Cheddad.

Ethics declarations

Conflict of interest

Johan Hall is an employee at the company Arkiv Digital AD AB (Sweden), the provider of this dataset. Agrin Hilmkil is an employee at the company Peltarion AB (Sweden). The rest of the authors declare that they have no conflict of interest.

Ethical Approval

The SHIBR dataset is a subset comprising random samples from the period 1800–1840, which were selected to abide by the GDPR (the European General Data Protection Regulation) law and thus are made open access.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cheddad, A., Kusetogullari, H., Hilmkil, A. et al. SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-06207-z

Download citation

Keywords

  • Historical data of birth records
  • Handwritten documents
  • Public dataset
  • Word spotting