Abstract
This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’ performance. The dataset is called SHIBR (the Swedish Historical Birth Records). The contribution of this paper is twofold. First, we believe it is the first and the largest Swedish dataset of its kind provided as open access (15,000 high-resolution colour images of the era between 1800 and 1840). We also perform some data mining of the dataset to uncover some statistics and facts that might be of interest and use to genealogists. Second, we provide a comprehensive survey of contemporary datasets in the field that are open to the public along with a compact review of word spotting techniques. The word transcription file contains 17 columns of information pertaining to each image (e.g., child’s first name, birth date, date of baptism, father's first/last name, mother’s first/last name, death records, town, job title of the father/mother, etc.). Moreover, we evaluate some deep learning models, pre-trained on two other renowned datasets, for word spotting in SHIBR. However, our dataset proved challenging due to the unique handwriting style. Therefore, the dataset could also be used for competitions dedicated to a large set of document analysis problems, including word spotting.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Digitising the past is a way to preserve history, restore deteriorating/uncompleted text, extract facts and information, and help in searching, document retrieval and data mining tasks. The digitisation of books/documents is among the objectives that current digital libraries and electronic government initiatives are putting on the top of their priorities. For example, dozens of universities, research centres and companies in Europe have together started a large-scale EU consortium called IMPACTFootnote 1 (Improving Access to Text) [1, 2]. Among the “Endangered Archives Programme” initiatives of the British Library is the digitisation of manuscripts of the Al-Aqsa Mosque Library, East Jerusalem [3]. This historical collection contains more than a hundred Arabic language titles that span over several Islamic periods from the ninth century CE to the end of the Ottoman rule in Palestine at the beginning of the twentieth century. These books span topics about the Arabic language and literature, logic, math, religion, and Sufism.Footnote 2
An old and still valid way to transcribe historical handwritten documents is to rely on crowdsourcing. It is the practice of gathering information or input into a task by acquiring the services of a large number of people (a.k.a. crowd). It is often characterised by small and short-term deals [4]. In a recent study, crowdsourcing, when combined with contemporary technology, is shown to deliver far more complete and validated data than automated processes alone could produce [5]. As such, the automated process of converting a historical document into a readable text is still posing various challenges. The work herein offers possible assistance in lifting some of these challenges by providing one of the most extensive free-access semi-annotated historical handwritten document datasets.
We conclude this section by noting that this dataset would enrich the availability of historical handwritten document datasets and help develop more accurate algorithms for word spotting, optical character recognition (OCR), document layout analysis and image binarization. It would also serve the research community interested in history and heritage (i.e., genealogists), see Fig. 1. This set of motifs was the ultimate impetus for the creation of the SHIBR dataset (the Swedish Historical Birth Records).Footnote 3 This complete-page dataset (SHIBR) complements the previously published numerical handwritten dataset (ARDIS) [6], both of which are generously provided for free by Arkiv Digital AD AB, a Swedish company.
2 Review of related public datasets
Different public handwritten document image datasets have been created and presented to resolve various document image challenges such as text line segmentation [7], word spotting [8], writer identification [9], digit and character segmentation and recognition [10,11,12], binarization [13], and a variety of other challenges [14,15,16]. These datasets enable researchers to develop automated and computationally efficient algorithms. Generally, the existing datasets are classified into two groups based on the era in which they were written: historical or modern datasets. The well-known and widely used document databases are listed in Table 1, several of which are described in this section.
George Washington database (GW) [17, 18]: This database is a baseline database for text line segmentation, word spotting and word recognition tasks. The Washington database consists of 20 historical handwritten document images written in English with longhand script and ink-type pen in the eighteenth century. Moreover, these images are annotated with 656 text lines, 4894-word instances, 1471-word classes and 82 letters.
IAM database [22, 23]: The IAM database contains 1539 handwritten modern document images written by 657 different English scriptwriters. The documents were scanned at a resolution of 300dpi and stored in greyscale colour to create this database. The document images are labelled using an automatic segmentation approach and are verified visually. The database consists of 5685 isolated and labelled sentences, 13,353 isolated and labelled text lines, 115,320 isolated and labelled words.
VML-HD database [24]: This database includes 668 handwritten document images. The documents in the database were written in Arabic by different writers between the years 1088 and 1451. All the words and characters in the document images are manually annotated with bounding boxes. As a result, this database consists of 159,149 annotated words and 326,289 annotated characters.
HADARA80P database [25]: This database contains 80 handwritten document images written in the Arabic language. This database is used for word segmentation problems, as the words in the document images are annotated with polygons.
Esposalles database [27, 28]: The Esposalles database is a Spanish historical handwriting document image database consisting of 173 document images. The documents were written between 1451 and 1905, and they contain information from the marriage licenses of Spanish citizens. These documents are collected from different books available at the archives of the Cathedral of Barcelona. All text blocks, lines, and transcriptions in the document images are manually labelled. Furthermore, this database has been used to develop handwriting recognition algorithms.
Other handwritten document image databases have also been created and can be found with more details in [19, 21, 26, 29]. Furthermore, the proposed SHIBR document dataset is comprehensively scrutinised and described in Sect. 4.
2.1 Limitations of existing document images databases
Even though some of the existing datasets have annotations, most of them have several limitations: (1) scarcity of a large number of document images; (2) lack of datasets with Swedish characters; (3) lack of availability of historical documents written in Swedish handwriting styles with various types of dip pens; and (4) lack of availability of datasets with significant variations of artefacts (e.g., degradation, bleed-through, ink leakage etc.). For instance, the George Washington dataset contains the least number of document datasets with 20 English document images. In contrast, the IFN/ENIT dataset, the most extensive document image database (see Table 1), consists of 6735 binary Arabic document images. Therefore, the main challenges when using these datasets for historical document image analysis are (1) dealing with a small number of document images which leads to small intra- and inter-class variations, (2) exhibiting a small number of artefacts, and (3) covering historical ancient handwriting styles insufficiently. Therefore, to support the development of research in historical document image analysis, it is essential to construct a new dataset that would address the shortcomings of the existing datasets. Thus, this paper proposes a new and large dataset (SHIBR) containing 15000 document images, which is the largest of its kind as far as we are aware. The SHIBR dataset is semi-annotated, easing the development of automated and semi-automated machine learning methods for document analysis applications. Furthermore, to the best of our knowledge, the SHIBR dataset is the largest historical handwritten document dataset and the first semi-annotated historical document image dataset with Swedish characters.
3 Challenges and opportunities in historical handwritten documents
In the context of handwritten document image analysis, many challenges need to be tackled [30]. Most of the state-of-the-art solutions focus mainly on word spotting and recognition challenges.
3.1 Word/pattern spotting
Over the past decade, an enormous collection of handwritten or machine-printed documents have been digitised to preserve the contained information. Word spotting methods are used to extract relevant information from these documents. Generally, word spotting in handwritten document images is much more complex than on machine-printed document images—the former consists of significant variations of handwriting styles and various character types in different languages. As the text in these historical documents was written by different writers, it generates large variations in appearances because of, on the one hand, skewness, curvature, aspect ratio, and size, and because of broken and connected words/characters on the other hand. As a result, these variations in writing styles in different languages may create endless diversities for word spotting in handwritten document images. Many word spotting methods have been proposed for document indexation. They can be classified into two groups: 1) segmentation-based and 2) segmentation-free methods.
Matching is a segmentation-free word spotting approach. It is the process of searching a target or template word image in document images. It is one of the basic approaches that have been applied for word spotting on document images. Moreover, it is mainly employed based on similarity or distance measure between the template word image cropped from a document image and the document images’ target region. In [31], a word-level matching scheme is proposed to search a template word image in printed document images using a feature-extraction technique. After that, the extracted features are used for similarity estimation for word spotting. In [32], another word spotting framework is proposed; the dynamic time warping (DTW)-based matching technique. The DTW matching algorithm is applied on machine-printed document images, providing superior results for word spotting. In [33], a block adjacency graph (BAG) method for word spotting is designed and employed based on similarity estimation between the template image and the moving window regions in document images. A word shape coding scheme is proposed by Bai et al. [34] that combines feature descriptors and a matching technique for word spotting in document images. A block-based document image descriptor used for word spotting in historical printed documents based on the template matching process is proposed by Rabaev et al. [35]. Their experiments show that this method provides promising results if the documents do not include too many undesired artefacts. Other word spotting-based matching methods can be found in [36,37,38]. The matching-based word spotting techniques have several drawbacks: 1) they are time-consuming, 2) they cannot overcome undesired artefacts involved in the handwritten documents, and 3) they often yield poor accuracy rates on handwritten text images.
Thus, learning-based segmentation-free word spotting techniques are designed and applied to increase word spotting accuracy. For instance, the Hidden Markov Models (HMMs) technique has been used for word spotting in handwritten documents [39, 40]. Besides, hybrid models of HMMs with different supervised learning methods have been developed that combine HMMs with support vector machine (SVM) [41] or with neural network (NN) [42] or with deep convolutional neural network (CNN) [43]. In another work, a new word spotting system for handwritten Urdu language document images is proposed [44]. The method uses several pre-processing steps such as binarization, connected component analysis and edge detection. Subsequently, for word spotting purposes, a sliding window based on an SVM classifier is used to spot Urdu words. In [45], a word spotting and recognition approach based on a common representation of word images and text strings is proposed. The method first extracts the standard features to decrease the dimensional space, and then the nearest neighbour algorithm is used for word spotting. Frinken et al. [46] have designed a novel method based on recurrent neural network (RNN) for word spotting in handwritten documents. In [47], an efficient patch-based framework combined with the scale-invariant feature transform (SIFT) descriptors is proposed for keyword spotting in historical document collections. In [48], a CNN architecture is designed for word spotting in handwritten documents. Extensive survey papers of words spotting methods can be found in [49,50,51,52,53].
SHIBR lends itself well as a challenging benchmark for word spotting methods. Since the SHIBR dataset, like many other historical documents, does not have segmented words, we exclusively look at the segmentation-free methods. These methods may, for example, rely on traditional computer vision to find interesting regions or on region-proposal networks like Faster-RCNN [54]. The Ctrl-F-Mini algorithm fulfils this criterion and has been shown to outperform many existing methods [55]. We, therefore, select Ctrl-F-Mini for the first benchmarks on SHIBR. While its dependence on bounding boxes during training makes it difficult to train on SHIBR, its pre-trained models may be readily evaluated on the dataset. Ctrl-F-Mini is a deep convolutional network model for segmentation-free word spotting. It is related to the Faster-RCNN alternative but uses Dilated Text Proposals (DTP) [55] instead of a Region Proposal Network (RPN) [54] to propose regions of a manuscript page potentially containing words. For each region, the network outputs the estimated probability of it depicting a word and a word embedding. The word embeddings are either the Pyramidal Histogram of Characters (PHOC) or the Discrete Cosine Transform of Words (DCToW) [55] embeddings. In word retrieval, the regions are ranked by their embeddings’ cosine similarity to the query string embedding.
3.2 Opportunities: genealogy research
Genealogy and the study of family history/tree are both interlinked. In the modern era, and with the success and spread of DNA sequencing in high-throughput genomic sequences, genealogy has become a vivid field. However, genealogists are also interested in unravelling the history of families by mining historical documents. Nevertheless, there exist contentious points that Hatton [56] has rethought in a study examining history, lineage, identity, and technology in relation to genealogy. As depicted in Fig. 2, the top twenty countries most interested in genealogy show diversity, with Sweden coming slightly above the USA. The statistics are retrieved from the Google Trends tool (a website by Google that analyses the popularity of top search queries in Google Search across various regions and languages).
3.3 Opportunities: window into the past
A window into the past may allow the investigation of assumptions about the historical position and actual practices and thinking of a particular epoch, presenting genealogists with opportunities to further their theories. For instance, Abildgren, K. explores what a crowdsourced genealogical online database can say about Denmark’s income inequality during the First World War [57]. Additionally, Zhu Z. constructs the concept system and relationships of the genealogy ontology and takes Wu Shilai, the ancestor of the 23rd generation of Wu’s, as an example to realize the visualisation of traditional Chinese genealogy [58].
4 SHIBR dataset
This dataset is retrieved from the Arkiv Digital AD AB image and index database. When a child was born in Sweden in the 1800s, he or she was registered by a priest in a church record book called Birth and Christening Records. These priests registered the child’s name, when the child was born and baptised, where the child lived, and information about the father and the child’s mother, as shown in Table 2. The transcription is based on manual annotation (at Arkiv Digital AD AB and its partners) of scanned images from 1800 to 1840.
4.1 Structure of SHIBR
The master dataset (SHIBRm) consists of 818,110 indexed rows and 64,084 images. This dataset is confidential and can only be used according to the agreement between the company Arkiv Digital AD AB and the Blekinge Tekniska Högskola (BTH). However, a subset comprising random samples from the period 1800–1840 was determined to abide by the GDPR (the European General Data Protection Regulation) law and, therefore, can be made open access. As such, the public dataset (SHIBRp) with semi-annotation consists of 10,500, 2250, 2250 images for training, testing and validation, respectively. Hence, in total, the SHIBRp public dataset consists of 15,000 high-resolution (2000 × 1300 to 6000 × 4000) images in RGB (Red, Green and Blue) colour space (~ 50 GB of data) that exhibit a variety of layouts, handwriting styles, background colour and degradations. Additionally, SHIBRp is associated with Excel spreadsheets for each of the three folders (training, testing and validation). In total, the spreadsheets contain 191,301 entries.
Swedish counties (län) covered: The counties that are recorded in these books are as follows. Gotland, Gävleborg, Norrbotten, Västerbotten, Västernorrland, Västmanland, Älvsborg,Footnote 4 and Örebro (see Fig. 3).
Description of the index columns: Each image (of the 15,000) corresponds to a double page of a book, and each of these images is associated with an entry in the annotation file (manual transcription) with 17 columns, as shown in Table 2. The structure of the transcribed file, along with a sample image, is exemplified in Fig. 4.
Example of a scanned book page and the view of a file containing transcribed data (17 columns). See Table 2 above for descriptions of columns 1–17
4.2 Mining SHIBRm – Statistical insights
Data mining methods aim at discovering frequently occurring patterns in a source dataset [59]. Here, we deploy simple basic statistics to identify potentially helpful information associated with the document images in the SHIBRm pertaining to the nineteenth century’s era. The findings listed here are merely examples of the uncharted side of the SHIBRm dataset and of what its public version, SHIBRp, can offer to the research community, especially to genealogists. Please note that the statistics drawn herein are only reflections of the set of data we currently have (i.e., SHIBRm). In no way should they be taken as a de facto reference to the total population’s overall reality during that era. We only analyse and describe the data that we have.
-
Birth rate stratified by county/year: By examining the birth rate, we can see that it has an overall increasing trend and seemingly aligned with the public statistics.Footnote 5 When the retrieved data is stratified by the county, we see, at the macro-level, that Älvsborgs län exhibits a large birth rate, as shown in Fig. 5. For example, this could be linked to socioeconomic status.
-
Rate of stillborn (dödfödd): Analogues to the birth rate, and probably driven by it (dependent variables), is the death rate which also exhibits an increasing trend at the macro-level. Table 3 shows that Gotlands län tops the list with 1.857% of total newborns, and at the bottom of the list is Norrbottens län with 0.749%. Worth noting is that baby boys consistently exhibit higher death rates than baby girls in the data we have spanning the period 1800 to 1840. We are uncertain if this difference is genuine and descriptive of the total population in that period. However, this finding is consistent with a recent report published by the Statistics Sweden SCB (a Swedish agency) stating that “infant mortality has been higher for boys than for girls but this difference between the sexes is almost non-existent in the twenty-first century” and with the Historical Statistics of Sweden [60]. Figure 6 shows the overall mortality rate stratified by gender, and Table 3 tabulates the rate at each county.
-
Period until baptised: Baptism is a Christian rite for acceptance and adoption into Christianity. It was socially unacceptable not to baptise a child during that era. SHIBRm stores the number of days from birth to baptism. Figure 7 illustrates Älvsborgs län as the county with a minor time interval between birth and baptism. Norrbottens län and Västerbottens län top the list, probably because being located in northern Sweden, where a large part of the population is made up of the Sámi people (Laplanders in English) who may have likely had to travel long distances to churches for baptism.
-
Most common first names (babies, women, men): Another typical trend to look at is the popularity of the first names given to babies and parents. Some of these names are fading away in modern Sweden, allowing for more trendy ones, especially among the young generation. Table 4 shows the top 10 most common names among newborns, fathers and mothers.
-
Age of women in the birth records: What is the average age of mothers in the birth records of SHIBRm during the period 1800–1840? Fig. 8 depicts a bar chart of the age distribution of women in all counties. More in-depth statistics stratified by counties are tabulated in Table 5.
-
Most typical job titles (women/men): A final factor we look at in SHIBRm data is the job title of men and women during that period. As can be seen from Table 6, most of the men were farmers, military officers, government employees (tax officers), etc. Women during that period were not empowered to have their own jobs, and thus we see a lack of job titles in their column. Maids and housekeepers were the only reported jobs we found in the large dataset SHIBRm.
Back in the days, the Swedish written language was not standardized like it is today. Most people could not read and write. In 1842 there was a law enforcing that every child must go to primary school in Sweden. Only certain groups of people could read and write, for example, priests. As such, the spelling of words and names could be different across Sweden and between individuals. This asymmetry exhibits heterogeneity in both spelling and writing. Historical texts heterogeneity adds more complexity to perform robust image processing and recognition tasks. However, after the first-round manual transcription of the SHIBR dataset by the company’s (Arkiv Digital AB) partners, the company carried out a validity check by Swedish native speakers to improve the transcription quality.
5 Experiments and results
In this section, we present experimental results of the chosen algorithm’s performance, Ctrl-F-Mini [55] (discussed in Sect. 3.1), on page retrieval in SHIBR. In all cases, we use the trained models and implementation made available by the original authors. These models were trained on the ADAM dataset with a learning rate of 0.001, multiplied by 0.1 at every 10,000 steps. After a total of 25,000 steps, the model with the highest hold-out-set score was used. While the Ctrl-F-Mini comes with models trained with different loss functions, we select the ones trained with the cosine loss as they have the best benchmark results. However, we evaluate the PHOC and the DCToW embedding variants. Finally, since one model is provided per fold, we always select the one corresponding to the first fold.
The SHIBRp scanned pages are converted to greyscale colour space and pre-processed using the model’s provided settings for the George Washington dataset. We use an NMS (Non-Maximum Suppression) threshold of 0.4 for the DTP regions for all experiments as this corresponds to the negative-match threshold in Ctrl-F-Mini [55]. We do not use any additional word-likeness filtering, and we do not limit the number of regions per document before predicting the embeddings.
5.1 Segmentation-free evaluation of word spotting
The most used metric for evaluating word spotting is the Mean Average Precision (mAP) [61] . For a given query q, the precision at k is:
where the relevance indicator rj(q) is 1 if the candidate with rank j matches q and 0 otherwise. Averaging the value over all possible k gives the Average Precision (AP)
To finally retrieve the mAP, the AP is averaged over the set of all queries \(Q\) for the task:
The mAP is calculated with instances of individual words to evaluate word spotting on datasets with bounding boxes. The relevance indicator rk(q) is determined by the word matching to ensure a particular overlap between the retrieved candidate and the ground truth bounding box. Since SHIBRp does not contain bounding boxes, this is not feasible.
To evaluate a word spotting algorithm on semi-annotated datasets like SHIBRp, we instead propose evaluating the mAP with respect to page retrieval (mAPpage). Instances are therefore scanned pages instead of individual words. The set of queries is chosen to be the set of all words occurring in the text. All queries are performed over all the pages, and the results are then ranked according to their single best match for a given query. Finally, relevance rj(q) is indicated if q occurs on a page at rank j.
5.2 Results
As a baseline on the SHIBRp test set, we compute the mAPpage with respect to page retrieval (Sect. 5.1) using the Ctrl-F-Mini models trained on the George Washington Dataset [17] and on the IAM Offline Handwriting Dataset [23]. We use the 100 most commonly occurring words in the test set as the set of queries \(Q\). The results, including a random baseline corresponding to randomly ranking all instances for each query, are presented in Table 7.
As might be expected, the models trained on the much larger IAM dataset are consistently slightly better than those trained on the George Washington dataset. However, none of the results significantly outperform the random baseline. We would like to highlight two particular challenges that could contribute to the poor results. First, we acknowledge that none of the queries, Swedish names, are present in the training sets. Previous results have indicated that out-of-vocabulary queries tend to be more difficult for word spotting models [55]. Second, none of the provided models has been exposed to Swedish characters or accounted for in the embeddings. Third, the cursive and connected nature of text lines in our dataset and the variation in the writing style hinder achieving robust results using existing pre-trained models [62]. Building generally applicable word spotting applications requires overcoming these challenges. Furthermore, since real-world applications (e.g., when dealing with big data) also perform page retrieval, the mAPpage represents an essential and complementary metric for evaluating word spotting methods.
6 Conclusion
This paper contributes to the research community by providing an open-access large dataset of historical handwritten documents. These documents result from a continuous effort to digitise church birth books available from parishes across Sweden. The dataset that we provide comprises 15000 high-resolution images in RGB colour space and transcribed files containing a wealth of information spanning a period of time from 1800 to 1840. The main goal of sharing the SHIBR dataset with the research community is to spark more initiatives to further develop robust document analysis algorithms (e.g., word spotting, document retrieval, character recognition, binarization, layout analysis, etc.) and to promote cross-disciplinary research studies.
Availability of data and material
The SHIBR data set will be available publicly as open access at the following permanent link upon acceptance: https://ardisdataset.github.io/SHIBR/
Notes
IMPACT: http://www.impact-project.eu/home/, [Online], accessed on 2020–04-11.
https://eap.bl.uk/collection/EAP521-1, [Online], accessed on 2020–04-11.
The SHIBR data set is open access available from the following link: https://ardisdataset.github.io/SHIBR/.
Älvsborg county was a county of Sweden until 1997 when it was merged with the counties of Göteborg, Bohus and Skaraborg to form Västra Götaland County. The county consisted of the provinces of Dalsland and the central part of Västergötland with the seat of residence in the city of Vänersborg.
Sveriges folkmängd från 1749 och fram till idag [Sweden’s population from 1749 until today] by the National Central Bureau of Statistics (SCB), 2017–10-27. Available from: https://www.scb.se/.
References
H Balk, A Conteh (2011) IMPACT: centre of competence in text digitisation. In: Proceedings of the 2011 workshop on historical document imaging and processing (pp. 155–160)
H Balk (2009) Poor access to digitised historical texts: the solutions of the IMPACT project. In: Proceedings of the third workshop on analytics for noisy unstructured text data (pp. 1–1)
M Krystyna, AH Qasem (2009) Digitizing the historical periodical collection at the Al-Aqsa Mosque Library in East Jerusalem. In: Proceedings IFLA world library and information Congress, Milan, Italy, August 24
Z Zakariah, N Janom, NH Arshad, SS Salleh, SRS Aris (2014) Crowdsourcing: the trend of prior studies. In: Proceedings of the 2014 4th international conference on artificial intelligence with applications in engineering and technology (ICAIET’14). IEEE computer society, USA, 129–133. DOI: https://doi.org/10.1109/ICAIET.2014.30
C Clausner, J Hayes, A Antonacopoulos (2019) Crowdsourcing historical tabular data: 1961 Census of England and Wales. In: Proceedings of the 5th international workshop on historical document imaging and processing (HIP’19). Association for Computing Machinery, New York, NY, USA, 42–47. DOI: https://doi.org/10.1145/3352631.3352643.
Kusetogullari H, Yavariabdi A, Cheddad A et al (2019) ARDIS: a Swedish historical handwritten digit dataset. Neural Comput Applic. https://doi.org/10.1007/s00521-019-04163-3
A Sanchez, PD Suarez, CAB Mello, ALI Oliveira , VMO Alves (2008) Text line segmentation in images of handwritten historical documents. In: Proceedings of the 2008 first workshops on image processing theory, tools and applications, Sousse, (pp. 1–6)
Zagoris K, Pratikakis I, Gatos B (2017) Unsupervised word spotting in historical handwritten document images using document-oriented local features. IEEE Trans Image Process 26(8):4032–4041. https://doi.org/10.1109/TIP.2017.2700721
C Djeddi, S Al-Maadeed, A Gattal, I Siddiqi, A Ennaji, HE Abed (2016) ICFHR2016 competition on multi-script writer demographics classification using “QUWI” database. In: Proceedings of the IEEE international conference on frontiers in handwriting recognition, (pp. 602–606)
Ahlawat S, Choudhary A (2020) Hybrid CNN-SVM classifier for handwritten digit recognition. Procedia Computer Science 167:2554–2560
R Alaasam, B Kurar, M Kassis , J El-Sana (2017) Experiment study on utilizing convolutional neural networks to recognize historical Arabic handwritten text. In: Proceedings of the 2017 1st international workshop on Arabic script analysis and recognition (ASAR), Nancy, (pp. 124–128)
Ribas FC, Oliveira LS, Britto AS, Sabourin R (2013) Handwritten digit segmentation: a comparative study. Int J Doc Anal Recognit 16:567–578
Ntirogiannis K, Gatos B, Pratikakis I (2014) A combined approach for the binarization of handwritten document images. Pattern Recogn Lett 35:3–15
DJ Kennard, AM Kent, WA Barrett (2011) Linking the past: discovering historical social networks from documents and linking to a genealogical database. In: Proceedings of the 2011 workshop on historical document imaging and processing (HIP 2011), New York, USA, (pp. 43–50)
DW Embley, S Machado, T Packer, J Park, A Zitzelberger ,SW Liddle, N Tate, DW Lonsdale (2011) Enabling search for facts and implied facts in historical documents. In: Proceedings 2011 workshop on historical document imaging and processing (HIP 2011), New York, USA, (pp. 59–66)
S Athenikos, (2009) WikiPhiloSofia and PanAnthropon: extraction and visualization of facts, relations, and networks for a digital humanities knowledge portal. In: Proceedings of the 20th ACM conference hypertext and hypermedia (Hypertext 2009), Torino, Italy, 2009
The Washington Database, Retrieved on 2020–06–20, from: http://www.fki.inf.unibe.ch/databases/iam-historical-document-database/washington-database
G Washington, George Washington Papers, Series 2, Letterbooks 1754 to 1799: Letterbook 1- Dec. 25, 1755. [Manuscript/Mixed Material] Retrieved from the Library of Congress. https://www.loc.gov/item/mgw2.001/
Sarkar R, Das N, Basu S et al (2012) CMATERdb1: a database of unconstrained handwritten bangla and bangla-English mixed script document image. IJDAR 15:71–83
Handwritten Keyword Spotting Competition (H-KWS /ICFHR 2016), Retrieved on 2020–06–20, from: https://www.prhlt.upv.es/contests/icfhr2016-kws/data.html
ICFHR2016 Competitions, Retrieved on 2020–06–05, from: http://www.nlpr.ia.ac.cn/icfhr2016/competitions.htm
The IAM Handwriting Database, Retrieved on 2020–06–20, from: http://www.iam.unibe.ch/fki/databases/iam-handwriting-database
Marti U, Bunke H (2002) The IAM-database: an english sentence database for off-line handwriting recognition. Int J Doc Anal Recognit 5:39–46
M Kassis (2018) VML-HD: The historical Arabic documents dataset for recognition systems (VML-HD). 1, ID: VML-HD1, URL: http://tc11.cvc.uab.es/datasets/VML-HD_1.
W Pantke, M Dennhardt, D Fecker, V Märgner T Fingscheidt (2014) An Historical handwritten Arabic dataset for segmentation-free word spotting - HADARA80P. In: Proceedings of the 14th international conference on frontiers in handwriting recognition, Heraklion, (pp. 15–20). doi: https://doi.org/10.1109/ICFHR.2014.11
B Kiessling, DS Ben Ezra, MT Miller BADAM, A public dataset for baseline detection in Arabic-script manuscripts. In Proceedings of the 5th International Workshop on Historical Document Imaging and Processing (HIP’19), ACM, 13–18. DOI: https://doi.org/10.1145/3352631.3352648.
The ESPOSALLES Database, Retrieved on 2020–06–20, from: http://dag.cvc.uab.es/the-esposalles-database/
Romero V, Fornés A, Serrano N, Sánchez JA, Toselli AH, Frinken V, Vidal E, Lladós J (2013) The ESPOSALLES database: an ancient marriage license corpus for off-line handwriting recognition. Pattern Recogn 46:1658–1669
The IFN/ENIT-database, Retrieved on 2020–06–20, from: http://www.ifnenit.com/download.htm
Hussain R, Raza A, Siddiqi I et al (2015) A comprehensive survey of handwritten document benchmarks: structure, usage and evaluation. J Image Video Proc. 46(1):1–24
T Rath, R Manmatha (2003) Features for word spotting in historical manuscripts. In: Proceedings of the 7th international conference on document analysis and recognition (ICDAR), (pp. 218–222)
T Mondal, N Ragot, JY Ramel, U Pal (2015) Performance evaluation of DTW and its variants for word spotting in degraded documents. In: Proceedings of the 13th international conference on document analysis and recognition (ICDAR), (pp. 1141–1145)
Bhardwaj A, Setlur S, Govindaraju V (2009) Keyword spotting techniques for Sanskrit documents. In: Huet G, Kulkarni A, Scharf P (eds) Lecture Notes in Artificial Intelligence 5402. Springer, Berlin, pp 403–416
E Ataer, P Duygulu (2006) Retrieval of ottoman documents. In: Proceedings of the 8th ACM International workshop on multimedia information retrieval, (pp. 155–162)
I Rabaev, I Dinstein, J El-Sana, K Kedem (2014) Segmentation-free keyword retrieval in historical document images. In: A Campilho, M Kamel (eds) Image analysis and recognition ICIAR 2014. Lecture notes in computer science, Springer
Leydier Y, Lebourgeois F, Emptoz H (2007) Text search for medieval manuscript images. Pattern Recogn 40:3552–3567
V Mane, L, Ragha (2009) Handwritten character recognition using elastic matching and PCA. In: Proceedings of the Int. Conf. Adv Comput, Commun Control, (pp. 410–415)
Y Lu, CL Tan (2002) Word searching in document images using word portion matching. In: Proceedings of the international workshop on document analysis systems (DAS 2002), Springer-Verlag, Berlin, Heidelberg, LNCS 2423, (pp. 319–328, 2002)
A Fischer, A Keller, V Frinken, H Bunke (2010. HMM-based word spotting in handwritten documents using subword models. In: Proceedings of the 20th international conference on pattern recognition (ICPR), IEEE, (pp. 3416–3419)
Bianne-Bernard AL, Menasri F, Mohamad RH, Mokbel C, Kermorvant C, Likforman-Sulem L (2011) Dynamic and contextual information in HMM modeling for handwritten word recognition. IEEE Trans Pattern Anal Mach Intell 33(10):2066–2080
A Ahmad, C Viard-Gaudin, M Khalid (2009) Lexicon-based word recognition using support vector machine and hidden Markov model. In: Proceedings of the 10th international conference on document analysis and recognition (ICDAR),(pp. 161–165)
Espana-Boquera S, Castro-Bleda M, Gorbe-Moya J, Zamora-Martinez F (2011) Improving offline handwritten text recognition with hybrid HMM/ANN models. IEEE Trans Pattern Anal Mach Intell 33(4):767–779
A C Rouhou, YK Kanoun (2019) Hybrid HMM/DNN system for Arabic handwriting keyword spotting. In: Proceedings of the 16th international conference on image analysis and recognition, Springer, Canada, (pp. 216–227), August 27–29. DOI: https://doi.org/10.1007/978-3-030-27202-9_19
MW Sagheer, N Nobile, CL. He , CY Suen (2010) A novel handwritten Urdu word spotting based on connected components analysis. In: Proceedings of the 20th international conference on pattern recognition, Istanbul, (pp. 2013–2016). Doi: https://doi.org/10.1109/ICPR.2010.496
J Almazán, A Gordo, A Fornés, E Valveny (2013) Handwritten word spotting with corrected attributes. In: Proceedings of the IEEE international conference on computer Vision, Sydney, Australia. (pp.1017–1024). DOI: https://doi.org/10.1109/ICCV.2013.130
Frinken V, Fischer A, Manmatha R, Bunke H (2012) A novel word spotting method based on recurrent neural networks. IEEE Trans Pattern Anal Mach Intell 34(2):211–224
Krishnan P, Jawahar CV (2019) HWNet v2: an efficient word image representation for handwritten documents. IJDAR 22:387–405
S Sudholt , GA Fink (2016) PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: Proceedings of the 15th international conference on frontiers in handwriting recognition (ICFHR), Shenzhen, (pp. 277–282). DOI: https://doi.org/10.1109/ICFHR.2016.0060
Ahmed R, Al-Khatib WG, Mahmoud S (2017) A Survey on handwritten documents word spotting. Int J Multimed Info Retr 6:31–47
Ali AAA, Suresha M (2020) Survey on Segmentation and Recognition of Handwritten Arabic Script. SN COMPUT SCI 1:192
Rath T, Manmatha R (2007) Word spotting for historical documents. IJDAR 9(2–4):139–152
Murugappan A, Ramachandran B, Dhavachelvan P (2011) A survey of keyword spotting techniques for printed document images. Artif Intell Rev 35(2):119–136
M Boualam, G Khaissidi, M Mrabti, Y Elfakir (2019) An overview on handwritten documents word spotting. In: Proceedings of the international conference on wireless technologies, embedded and intelligent systems (WITS), 3–4 April 2019
S Ren, KHe, R Girshick, J Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: C Cortes, ND Lawrence, DD Lee, M Sugiyama, R Garnett (eds.), advances in neural information processing systems 28 (Curran Associates, Inc.) (pp. 91–99)
T Wilkinson, J Lindström, A Brun (2018) Neural word search in historical manuscript collections. arXiv preprintarXiv:1812.02771
Hatton SB (2019) History, kinship, identity, and technology: toward answering the question “what is (family) genealogy?” Genealogy 3(1):2. https://doi.org/10.3390/genealogy3010002
Abildgren K (2019) Mining archival genealogy databases to gain new insights into broader historical issues. Digit Libr Perspect 35(3/4):259–270. https://doi.org/10.1108/DLP-07-2019-0025
Z Zhu (2020) Content mining and visualization of traditional genealogies of China – Deployed on the genealogy of Wu’s in Gaoqian, Zhejiang. In: Proceedings of the iconference 2020 sustainable digital communities proceedings. March 23 – 27, Borås, Sweden
Wojciechowski M, Zakrzewicz M (2002) Dataset filtering techniques in constraint-based frequent pattern mining. In: Hand DJ, Adams NM, Bolton RJ (eds) Pattern detection and discovery lecture notes in computer science. Springer, Berlin
Statistiska Centralbyrån [National Central Bureau of Statistics]. (1969). Historical Statistics of Sweden: Part 1. Population 1720–1967, Stockholm (2nd edition). Available fromhttp://share.scb.se/OV9993/Data/Historisk%20statistik/Historisk%20statistik%20f%C3%B6r%20Sverige%201700-1900-tal/Del1-Befolkning-1720-1967.pdf
Giotis AP, Sfikas G, Gatos B, Nikou C (2017) A survey of document image word spotting techniques. Pattern Recogn 68:310–332
A Cheddad (2016) Towards query by text example for pattern spotting in historical documents. In: Proceedings of the 7th international conference on computer science and information technology (CSIT), 13–14 July 2016 Amman, Jordan, (pp. 1–6), doi: https://doi.org/10.1109/CSIT.2016.7549479.
Acknowledgements
This project is funded by the research project “DocPRESERV: Preserving & Processing Historical Document Images with Artificial Intelligence”, STINT, the Swedish Foundation for International Cooperation in Research and Higher Education (Grant: AF2020-8892) and by the research project “Scalable Resource Efficient Systems for Big Data Analytics”, the Knowledge Foundation (Grant: 20140032) in Sweden. We also acknowledge the support of the Swedish company, Arkiv Digital AD AB, for providing the SHIBR dataset and for allowing us to make it open access. Finally, we acknowledge the editorial committee's support and the insightful comments and suggestions of the anonymous reviewers.
Funding
Open access funding provided by Blekinge Institute of Technology.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Johan Hall is an employee at the company Arkiv Digital AD AB (Sweden), the provider of this dataset. Agrin Hilmkil is an employee at the company Peltarion AB (Sweden). The rest of the authors declare that they have no conflict of interest.
Ethical Approval
The SHIBR dataset is a subset comprising random samples from the period 1800–1840, which were selected to abide by the GDPR (the European General Data Protection Regulation) law and thus are made open access.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cheddad, A., Kusetogullari, H., Hilmkil, A. et al. SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset. Neural Comput & Applic 33, 15863–15875 (2021). https://doi.org/10.1007/s00521-021-06207-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06207-z