Abstract
This paper deals with the clustering of words from parish records. Clustering is essential for downstream standardization. In the past, mostly in the 17th and beginning of 18th century, the names did not have a standard form, thus for further work, it is essential to create clusters of words that have the same meaning. Besides names, the parish records are in the various languages - Czech, German, Latin, so if we want to have relations between people, occupations, or causes of death in one language, we need to standardize it.
The first step of standardization is pre-processing, then we compare words, and the last step is classification into clusters. The most crucial step here is a comparison of words. We have tested various approaches like Levenshtein distance and its modifications, Q-grams, Jaro-Winkler, and phonetic codings like Soundex and Double-Metaphone. All those methods have been with types of words that can appear in parish records -first and last names, occupation, village, and relationship between people. From these tests, we have chosen the most suitable ways for the clustering of different types of words.
Keywords
This work was supported by TACR No. TL01000130, by BUT project FIT-S-17-4014 and the IT4IXS: IT4Innovations Excellence in Science project (LQ1602).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gottapu, R.D., Dagli, C., Ali, B.: Entity resolution using convolutional neural network. Procedia Comput. Sci. 95, 153–158 (2016)
Dintelman, S., Maness, T.: Reconstituting the population of a small European town using probabilistic record linking: a case study. In: Family History Technology Workshop, BYU (2009)
Milani, G., Masciullo, C., et al.: Computer-based genealogy reconstruction in founder populations. J. Biomed. Inf. 44, 997–1003 (2011)
Traglia, M., et al.: Heritability and demographic analyses in the large isolated population of Val Borbera suggest advantages in mapping complex traits genes. PLoS ONE 4, e7554 (2009)
Bloothooft, G., et al.: Population Reconstruction. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19884-2. ISBN 978-3-319-198833-5
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997). ISBN 0-521-58519-8
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage. Entity Resolution and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2. ISBN 978-3-642-31163-5
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, pp. 354–359. American Statistical Association (1990)
Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, pp. 166–172. ACM, New York (1996). ISBN 0-89791-792-8
Zboril, F., Rozman, J., Kocí, R.: Algorithmic creation of genealogical models. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) ISDA 2018 2018. AISC, vol. 941, pp. 650–658. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16660-1_63. ISBN 978-3-030-16659-5
Rozman, J., Zboril, F.: Persons linking in baptism records. In: Workshop PAOS2018 and PASSCR2018 of JIST2018 Conference, pp. 43–54, Awaji (2018). ISSN 1613–0073
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rozman, J., Hříbek, D., Zbořil, F. (2020). Testing of Various Approaches for Semiautomatic Parish Records Word Standardization. In: Wang, X., Lisi, F., Xiao, G., Botoeva, E. (eds) Semantic Technology. JIST 2019. Communications in Computer and Information Science, vol 1157. Springer, Singapore. https://doi.org/10.1007/978-981-15-3412-6_3
Download citation
DOI: https://doi.org/10.1007/978-981-15-3412-6_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3411-9
Online ISBN: 978-981-15-3412-6
eBook Packages: Computer ScienceComputer Science (R0)