Skip to main content
Log in

Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This paper introduces a new approach for automatically identifying the temporal origin of the digitized historical documents stored as images on the example from the Balkan region. The approach is based on the concept that differentiation in the orthography style is determined by the evolution of scripts or languages over time. It is characterized by a phase of script coding, mapping the letters of the document into a sequence of numerical codes. Each code is associated with a gray level in the image space. Accordingly, the sequence of numerical codes can be transformed into an image. Then, texture analysis is used on the obtained image for the extraction of the document features. At the end, the feature vector of the document is classified for recognizing its orthography style. An experiment is performed on two databases and on a test collection of historical documents extracted from digitized books in Slavonic–Serbian and Serbian languages written in Cyrillic script and in Croatian recension of the Old Church Slavonic language written in angular Glagolitic script. Obtained results show the efficacy of the proposed approach, its robustness to ‘noisy' documents and its superiority when compared with other approaches using the language or script discrimination for orthography recognition in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28

Similar content being viewed by others

Notes

  1. http://stari.nsk.hr/home.aspx?id=24.

  2. http://digitalna.nb.rs/.

References

  1. Coulmas F (1996) The Blackwell encyclopedia of writing systems. Blackwell, Oxford, p 379

    Google Scholar 

  2. Garrette D, Alpert-Abrams, H (2016) An unsupervised model of orthographic variation for historical document transcription. In: Proceedings of the 15th annual conference of the North American chapter of the association for computational linguistics: human language technologies, San Diego, USA, pp 467–472

  3. Biller O, El-Sana J, Kedem K (2014) The influence of language orthographic characteristics on digital word recognition. In: Proceedings of the 11th IAPR international workshop on document analysis systems, Tours, France, pp 131–135

  4. Reffle U, Ringlstetter C (2013) Unsupervised profiling of OCRed historical documents. Pattern Recogn 46:1346–1357

    Article  Google Scholar 

  5. Brodić D, Amelio A, Milivojević ZN (2016) Identification of Fraktur and latin scripts in German historical documents using image texture analysis. Appl Artif Intell 30(5):379–395

    Article  Google Scholar 

  6. Brodić D, Amelio A, Milivojević ZN (2016) Language discrimination by texture analysis of the image corresponding to the text. Neural Comput Appl. https://doi.org/10.1007/s00521-016-2527-x

    Article  Google Scholar 

  7. Brodić D, Amelio A, Milivojević ZN (2017) An approach to the language discrimination in different scripts using adjacent local binary pattern. J Exper Theor Artif Intell 29(5):929–947

    Article  Google Scholar 

  8. Brodić D, Amelio A, Milivojević Z N (2015) Classification of the scripts in medieval documents from Balkan region by run-length texture analysis. In: Proceedings of 22nd international conference on neural information processing, Istanbul, Turkey, pp 442–450

  9. Brodić D, Amelio A, Milivojević ZN (2017) Clustering documents in evolving languages by image texture analysis. Appl Intell 46(4):916–933

    Article  Google Scholar 

  10. Prajna R, Ramya VR, Mamatha HR (2015) A study of different text line extraction techniques for multi-font and multi-size printed Kannada documents. Int J Comput Appl 119(11):32–38

    Google Scholar 

  11. Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal Mach Intell 8(20):877–882

    Article  Google Scholar 

  12. Missale Romanum Glagolitice (1483) Kosinje, Croatia

  13. Baromic’s Breviary (1493) Venice

  14. Berčić I (1862) Foundations of the old Slavic language written by Glagolitic scripts to read the church books, Prague, Czech Republic

  15. Tang X (1998) Texture information in run-length matrices. IEEE Trans Image Process 7(11):1602–1609

    Article  Google Scholar 

  16. Galloway MM (1975) Texture analysis using gray level run lengths. Comput Graph Image Process 4(2):172–179

    Article  Google Scholar 

  17. Chu A, Sehgal CM, Greenleaf JF (1990) Use of gray value distribution of run lengths for texture analysis. Pattern Recogn Lett 11(6):415–419

    Article  MATH  Google Scholar 

  18. Dasarathy BR, Holder EB (1991) Image characterizations based on joint gray-level run-length distributions. Pattern Recogn Lett 12(8):497–502

    Article  Google Scholar 

  19. Nosaka R, Ohkawa Y, Fukui K (2011) Feature extraction based on co-occurrence of adjacent local binary patterns. In: Proceedings of 5th Pacific rim symposium on image and video technology (PSIVT) 7088, Gwangju, South Korea, pp 82–91

  20. Chawki D, Labiba SM (2010) A texture based approach for Arabic writer identification and verification. In: Proceedings of 2010 international conference on machine and web intelligence, Algiers, pp 115–120

  21. Liu L, Zhang H, Feng A, Wan X, Guo J (2010) Simplified local binary pattern descriptor for character recognition of vehicle license plate. In: Proceedings of seventh international conference on computer graphics, imaging and visualization, Sydney, Australia, pp 157–161

  22. Ojala T, Pietikainen M, Harwood D (1996) A comparative study of texture measures with classification based on featured distributions. Pattern Recogn 29(1):51–59

    Article  Google Scholar 

  23. Ojala T, Pietikäinen M, Mäenpää T (2002) Multi-resolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24:971–987

    Article  MATH  Google Scholar 

  24. Brodić D, Maluckov ČA, Milivojević ZN, Draganov IR (2014) Differentiation of the script using adjacent local binary patterns. In: Proceedings of 16th international conference on artificial intelligence: methodology, systems and applications (AIMSA) 8722, Varna, Bulgaria, pp 162–169

  25. Russell S, Norvig P (2003) Artificial intelligence: a modern approach [1995], 2nd edn. Prentice Hall, Upper Saddle River

    MATH  Google Scholar 

  26. Raschka S (2014) Naive Bayes and text classification: introduction and theory. Cornell University Library, Ithaca

    Google Scholar 

  27. Shahid M, Hassan SS, Rafi M (2011) Comparing SVM and naive Bayes classifiers for text categorization with wikitology as knowledge enrichment. In: Proceedings of IEEE 14th international multi-topic conference, Karachi, Pakistan, pp 31–34

  28. Ting SL, Ip WH, Tsang AH (2011) Is Naive Bayes a good classifier for document classification? Int J Softw Eng Appl 5(3):37–46

    Google Scholar 

  29. Zhang H (2004) The optimality of Naive Bayes. In: Proceedings of FLAIRS conference, AAAI Press

  30. Stojković A (1803) Fisika. Štamparija Kraljevskog Univerziteta, Budim

    Google Scholar 

  31. Stefanović Karadžić V (1828) Građa za Srpsku Istoriju našega vremena. Štamparija Kraljevskog Univerziteta, Budim

    Google Scholar 

  32. Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185

    MathSciNet  Google Scholar 

  33. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  34. Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188

    Article  MathSciNet  MATH  Google Scholar 

  35. Nicolaou A, Bagdanov AD, Gómez L, Karatzas D (2016) Visual script and language identification. In: Proceedings of 12th IAPR workshop on document analysis systems (DAS), Santorini, Greece, pp 393–398

  36. Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27:1720–1732

    Article  Google Scholar 

  37. Confusion Matrix. http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_ma trix/confusion_matrix.html

  38. Cross Validation (1997) https://www.cs.cmu.edu/~schneide/tut5/node42.html

Download references

Acknowledgements

This work was supported by the Ministry of Education, Science and Technological Development of the Republic Serbia [TR33037].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessia Amelio.

Ethics declarations

Conflict of interest

Author Darko Brodić declares that he has no conflict of interest. Author Alessia Amelio declares that she has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brodić, D., Amelio, A. Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents. Neural Comput & Applic 31, 3493–3513 (2019). https://doi.org/10.1007/s00521-017-3292-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-017-3292-1

Keywords

Navigation