Abstract
This paper introduces a new approach for automatically identifying the temporal origin of the digitized historical documents stored as images on the example from the Balkan region. The approach is based on the concept that differentiation in the orthography style is determined by the evolution of scripts or languages over time. It is characterized by a phase of script coding, mapping the letters of the document into a sequence of numerical codes. Each code is associated with a gray level in the image space. Accordingly, the sequence of numerical codes can be transformed into an image. Then, texture analysis is used on the obtained image for the extraction of the document features. At the end, the feature vector of the document is classified for recognizing its orthography style. An experiment is performed on two databases and on a test collection of historical documents extracted from digitized books in Slavonic–Serbian and Serbian languages written in Cyrillic script and in Croatian recension of the Old Church Slavonic language written in angular Glagolitic script. Obtained results show the efficacy of the proposed approach, its robustness to ‘noisy' documents and its superiority when compared with other approaches using the language or script discrimination for orthography recognition in the literature.
Similar content being viewed by others
References
Coulmas F (1996) The Blackwell encyclopedia of writing systems. Blackwell, Oxford, p 379
Garrette D, Alpert-Abrams, H (2016) An unsupervised model of orthographic variation for historical document transcription. In: Proceedings of the 15th annual conference of the North American chapter of the association for computational linguistics: human language technologies, San Diego, USA, pp 467–472
Biller O, El-Sana J, Kedem K (2014) The influence of language orthographic characteristics on digital word recognition. In: Proceedings of the 11th IAPR international workshop on document analysis systems, Tours, France, pp 131–135
Reffle U, Ringlstetter C (2013) Unsupervised profiling of OCRed historical documents. Pattern Recogn 46:1346–1357
Brodić D, Amelio A, Milivojević ZN (2016) Identification of Fraktur and latin scripts in German historical documents using image texture analysis. Appl Artif Intell 30(5):379–395
Brodić D, Amelio A, Milivojević ZN (2016) Language discrimination by texture analysis of the image corresponding to the text. Neural Comput Appl. https://doi.org/10.1007/s00521-016-2527-x
Brodić D, Amelio A, Milivojević ZN (2017) An approach to the language discrimination in different scripts using adjacent local binary pattern. J Exper Theor Artif Intell 29(5):929–947
Brodić D, Amelio A, Milivojević Z N (2015) Classification of the scripts in medieval documents from Balkan region by run-length texture analysis. In: Proceedings of 22nd international conference on neural information processing, Istanbul, Turkey, pp 442–450
Brodić D, Amelio A, Milivojević ZN (2017) Clustering documents in evolving languages by image texture analysis. Appl Intell 46(4):916–933
Prajna R, Ramya VR, Mamatha HR (2015) A study of different text line extraction techniques for multi-font and multi-size printed Kannada documents. Int J Comput Appl 119(11):32–38
Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal Mach Intell 8(20):877–882
Missale Romanum Glagolitice (1483) Kosinje, Croatia
Baromic’s Breviary (1493) Venice
Berčić I (1862) Foundations of the old Slavic language written by Glagolitic scripts to read the church books, Prague, Czech Republic
Tang X (1998) Texture information in run-length matrices. IEEE Trans Image Process 7(11):1602–1609
Galloway MM (1975) Texture analysis using gray level run lengths. Comput Graph Image Process 4(2):172–179
Chu A, Sehgal CM, Greenleaf JF (1990) Use of gray value distribution of run lengths for texture analysis. Pattern Recogn Lett 11(6):415–419
Dasarathy BR, Holder EB (1991) Image characterizations based on joint gray-level run-length distributions. Pattern Recogn Lett 12(8):497–502
Nosaka R, Ohkawa Y, Fukui K (2011) Feature extraction based on co-occurrence of adjacent local binary patterns. In: Proceedings of 5th Pacific rim symposium on image and video technology (PSIVT) 7088, Gwangju, South Korea, pp 82–91
Chawki D, Labiba SM (2010) A texture based approach for Arabic writer identification and verification. In: Proceedings of 2010 international conference on machine and web intelligence, Algiers, pp 115–120
Liu L, Zhang H, Feng A, Wan X, Guo J (2010) Simplified local binary pattern descriptor for character recognition of vehicle license plate. In: Proceedings of seventh international conference on computer graphics, imaging and visualization, Sydney, Australia, pp 157–161
Ojala T, Pietikainen M, Harwood D (1996) A comparative study of texture measures with classification based on featured distributions. Pattern Recogn 29(1):51–59
Ojala T, Pietikäinen M, Mäenpää T (2002) Multi-resolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24:971–987
Brodić D, Maluckov ČA, Milivojević ZN, Draganov IR (2014) Differentiation of the script using adjacent local binary patterns. In: Proceedings of 16th international conference on artificial intelligence: methodology, systems and applications (AIMSA) 8722, Varna, Bulgaria, pp 162–169
Russell S, Norvig P (2003) Artificial intelligence: a modern approach [1995], 2nd edn. Prentice Hall, Upper Saddle River
Raschka S (2014) Naive Bayes and text classification: introduction and theory. Cornell University Library, Ithaca
Shahid M, Hassan SS, Rafi M (2011) Comparing SVM and naive Bayes classifiers for text categorization with wikitology as knowledge enrichment. In: Proceedings of IEEE 14th international multi-topic conference, Karachi, Pakistan, pp 31–34
Ting SL, Ip WH, Tsang AH (2011) Is Naive Bayes a good classifier for document classification? Int J Softw Eng Appl 5(3):37–46
Zhang H (2004) The optimality of Naive Bayes. In: Proceedings of FLAIRS conference, AAAI Press
Stojković A (1803) Fisika. Štamparija Kraljevskog Univerziteta, Budim
Stefanović Karadžić V (1828) Građa za Srpsku Istoriju našega vremena. Štamparija Kraljevskog Univerziteta, Budim
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
Nicolaou A, Bagdanov AD, Gómez L, Karatzas D (2016) Visual script and language identification. In: Proceedings of 12th IAPR workshop on document analysis systems (DAS), Santorini, Greece, pp 393–398
Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27:1720–1732
Confusion Matrix. http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_ma trix/confusion_matrix.html
Cross Validation (1997) https://www.cs.cmu.edu/~schneide/tut5/node42.html
Acknowledgements
This work was supported by the Ministry of Education, Science and Technological Development of the Republic Serbia [TR33037].
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Author Darko Brodić declares that he has no conflict of interest. Author Alessia Amelio declares that she has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Brodić, D., Amelio, A. Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents. Neural Comput & Applic 31, 3493–3513 (2019). https://doi.org/10.1007/s00521-017-3292-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-017-3292-1