Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents

Brodić, Darko; Amelio, Alessia

doi:10.1007/s00521-017-3292-1

Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents

Original Article
Published: 27 November 2017

Volume 31, pages 3493–3513, (2019)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Darko Brodić¹ &
Alessia Amelio²

163 Accesses
1 Citation
Explore all metrics

Abstract

This paper introduces a new approach for automatically identifying the temporal origin of the digitized historical documents stored as images on the example from the Balkan region. The approach is based on the concept that differentiation in the orthography style is determined by the evolution of scripts or languages over time. It is characterized by a phase of script coding, mapping the letters of the document into a sequence of numerical codes. Each code is associated with a gray level in the image space. Accordingly, the sequence of numerical codes can be transformed into an image. Then, texture analysis is used on the obtained image for the extraction of the document features. At the end, the feature vector of the document is classified for recognizing its orthography style. An experiment is performed on two databases and on a test collection of historical documents extracted from digitized books in Slavonic–Serbian and Serbian languages written in Cyrillic script and in Croatian recension of the Old Church Slavonic language written in angular Glagolitic script. Obtained results show the efficacy of the proposed approach, its robustness to ‘noisy' documents and its superiority when compared with other approaches using the language or script discrimination for orthography recognition in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dating the Historical Documents from Digitalized Books by Orthography Recognition

Clustering documents in evolving languages by image texture analysis

Article 26 December 2016

Classification of the Scripts in Medieval Documents from Balkan Region by Run-Length Texture Analysis

Notes

References

Coulmas F (1996) The Blackwell encyclopedia of writing systems. Blackwell, Oxford, p 379
Google Scholar
Garrette D, Alpert-Abrams, H (2016) An unsupervised model of orthographic variation for historical document transcription. In: Proceedings of the 15th annual conference of the North American chapter of the association for computational linguistics: human language technologies, San Diego, USA, pp 467–472
Biller O, El-Sana J, Kedem K (2014) The influence of language orthographic characteristics on digital word recognition. In: Proceedings of the 11th IAPR international workshop on document analysis systems, Tours, France, pp 131–135
Reffle U, Ringlstetter C (2013) Unsupervised profiling of OCRed historical documents. Pattern Recogn 46:1346–1357
Article Google Scholar
Brodić D, Amelio A, Milivojević ZN (2016) Identification of Fraktur and latin scripts in German historical documents using image texture analysis. Appl Artif Intell 30(5):379–395
Article Google Scholar
Brodić D, Amelio A, Milivojević ZN (2016) Language discrimination by texture analysis of the image corresponding to the text. Neural Comput Appl. https://doi.org/10.1007/s00521-016-2527-x
Article Google Scholar
Brodić D, Amelio A, Milivojević ZN (2017) An approach to the language discrimination in different scripts using adjacent local binary pattern. J Exper Theor Artif Intell 29(5):929–947
Article Google Scholar
Brodić D, Amelio A, Milivojević Z N (2015) Classification of the scripts in medieval documents from Balkan region by run-length texture analysis. In: Proceedings of 22nd international conference on neural information processing, Istanbul, Turkey, pp 442–450
Brodić D, Amelio A, Milivojević ZN (2017) Clustering documents in evolving languages by image texture analysis. Appl Intell 46(4):916–933
Article Google Scholar
Prajna R, Ramya VR, Mamatha HR (2015) A study of different text line extraction techniques for multi-font and multi-size printed Kannada documents. Int J Comput Appl 119(11):32–38
Google Scholar
Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal Mach Intell 8(20):877–882
Article Google Scholar
Missale Romanum Glagolitice (1483) Kosinje, Croatia
Baromic’s Breviary (1493) Venice
Berčić I (1862) Foundations of the old Slavic language written by Glagolitic scripts to read the church books, Prague, Czech Republic
Tang X (1998) Texture information in run-length matrices. IEEE Trans Image Process 7(11):1602–1609
Article Google Scholar
Galloway MM (1975) Texture analysis using gray level run lengths. Comput Graph Image Process 4(2):172–179
Article Google Scholar
Chu A, Sehgal CM, Greenleaf JF (1990) Use of gray value distribution of run lengths for texture analysis. Pattern Recogn Lett 11(6):415–419
Article MATH Google Scholar
Dasarathy BR, Holder EB (1991) Image characterizations based on joint gray-level run-length distributions. Pattern Recogn Lett 12(8):497–502
Article Google Scholar
Nosaka R, Ohkawa Y, Fukui K (2011) Feature extraction based on co-occurrence of adjacent local binary patterns. In: Proceedings of 5th Pacific rim symposium on image and video technology (PSIVT) 7088, Gwangju, South Korea, pp 82–91
Chawki D, Labiba SM (2010) A texture based approach for Arabic writer identification and verification. In: Proceedings of 2010 international conference on machine and web intelligence, Algiers, pp 115–120
Liu L, Zhang H, Feng A, Wan X, Guo J (2010) Simplified local binary pattern descriptor for character recognition of vehicle license plate. In: Proceedings of seventh international conference on computer graphics, imaging and visualization, Sydney, Australia, pp 157–161
Ojala T, Pietikainen M, Harwood D (1996) A comparative study of texture measures with classification based on featured distributions. Pattern Recogn 29(1):51–59
Article Google Scholar
Ojala T, Pietikäinen M, Mäenpää T (2002) Multi-resolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24:971–987
Article MATH Google Scholar
Brodić D, Maluckov ČA, Milivojević ZN, Draganov IR (2014) Differentiation of the script using adjacent local binary patterns. In: Proceedings of 16th international conference on artificial intelligence: methodology, systems and applications (AIMSA) 8722, Varna, Bulgaria, pp 162–169
Russell S, Norvig P (2003) Artificial intelligence: a modern approach [1995], 2nd edn. Prentice Hall, Upper Saddle River
MATH Google Scholar
Raschka S (2014) Naive Bayes and text classification: introduction and theory. Cornell University Library, Ithaca
Google Scholar
Shahid M, Hassan SS, Rafi M (2011) Comparing SVM and naive Bayes classifiers for text categorization with wikitology as knowledge enrichment. In: Proceedings of IEEE 14th international multi-topic conference, Karachi, Pakistan, pp 31–34
Ting SL, Ip WH, Tsang AH (2011) Is Naive Bayes a good classifier for document classification? Int J Softw Eng Appl 5(3):37–46
Google Scholar
Zhang H (2004) The optimality of Naive Bayes. In: Proceedings of FLAIRS conference, AAAI Press
Stojković A (1803) Fisika. Štamparija Kraljevskog Univerziteta, Budim
Google Scholar
Stefanović Karadžić V (1828) Građa za Srpsku Istoriju našega vremena. Štamparija Kraljevskog Univerziteta, Budim
Google Scholar
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185
MathSciNet Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
Article MathSciNet MATH Google Scholar
Nicolaou A, Bagdanov AD, Gómez L, Karatzas D (2016) Visual script and language identification. In: Proceedings of 12th IAPR workshop on document analysis systems (DAS), Santorini, Greece, pp 393–398
Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27:1720–1732
Article Google Scholar
Confusion Matrix. http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_ma trix/confusion_matrix.html
Cross Validation (1997) https://www.cs.cmu.edu/~schneide/tut5/node42.html

Download references

Acknowledgements

This work was supported by the Ministry of Education, Science and Technological Development of the Republic Serbia [TR33037].

Author information

Authors and Affiliations

Technical Faculty in Bor, University of Belgrade, Vojske Jugoslavije 12, Bor, 19210, Serbia
Darko Brodić
DIMES University of Calabria, Via P. Bucci Cube 44, 87036, Rende, CS, Italy
Alessia Amelio

Authors

Darko Brodić
View author publications
You can also search for this author in PubMed Google Scholar
Alessia Amelio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessia Amelio.

Ethics declarations

Conflict of interest

Author Darko Brodić declares that he has no conflict of interest. Author Alessia Amelio declares that she has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 72 KB)

Supplementary material 2 (pdf 624 KB)

Supplementary material 3 (pdf 32 KB)

Supplementary material 4 (pdf 47 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brodić, D., Amelio, A. Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents. Neural Comput & Applic 31, 3493–3513 (2019). https://doi.org/10.1007/s00521-017-3292-1

Download citation

Received: 27 February 2017
Accepted: 16 November 2017
Published: 27 November 2017
Issue Date: August 2019
DOI: https://doi.org/10.1007/s00521-017-3292-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents

Abstract

Access this article

Similar content being viewed by others

Dating the Historical Documents from Digitalized Books by Orthography Recognition

Clustering documents in evolving languages by image texture analysis

Classification of the Scripts in Medieval Documents from Balkan Region by Run-Length Texture Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Electronic supplementary material

Supplementary material 1 (pdf 72 KB)

Supplementary material 2 (pdf 624 KB)

Supplementary material 3 (pdf 32 KB)

Supplementary material 4 (pdf 47 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents

Abstract

Access this article

Similar content being viewed by others

Dating the Historical Documents from Digitalized Books by Orthography Recognition

Clustering documents in evolving languages by image texture analysis

Classification of the Scripts in Medieval Documents from Balkan Region by Run-Length Texture Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Electronic supplementary material

Supplementary material 1 (pdf 72 KB)

Supplementary material 2 (pdf 624 KB)

Supplementary material 3 (pdf 32 KB)

Supplementary material 4 (pdf 47 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation