Ancient Geez script recognition using deep learning
- 258 Downloads
Handwritten text recognition is one of the most valuable recognition systems because of the unique characteristics of each person's handwriting. Thus, recognition systems need to be more adaptable to recognize same or different characters with different characteristics. On the other hand, one of the most challenging tasks in handwritten text recognition problems is recognizing ancient documents which include several noise within them. While digitizing these documents, these noise appear in different types which effects any recognition system. So, digitizing ancient documents, applying proper pre-processing techniques and performing effective classifier are the main steps of efficient recognition system. In this paper, a complete Ethiopian ancient Geez character recognition system using deep convolutional neural network is proposed in order to recognize twenty-six base characters of this alphabet. The proposed system obtained an accuracy of 99.39% with a model loss of 0.044 which demonstrates its efficiency.
KeywordsHandwritten character recognition Ancient document recognition Convolutional neural networks
Ethiopia is the only African country with its own indigenous alphabets and writing systems which is the Geez or Amharic alphabet . Most of the African countries use English and Arabic scripts or alphabets.
Geez is a type of Semitic language and mostly used in Ethiopian and Eritrean Orthodox Tewahedo churches (EOTC). Geez belongs in the South Arabic dialects and Amharic which is one of the most spoken languages of Ethiopia . There are more than 80 languages and up to 200 dialects spoken in Ethiopian; some of those languages use Geez as their writing script. Among the languages, Geez, Amharic, and Tigrinya are the most spoken languages, and they are written and read from left-to-right, unlike the other Semitic languages . There are lots of ancient manuscripts which are written in Geez currently in Ethiopian, especially in the EOTCs. However, it is not possible to find a digital format of those manuscripts due to a lack of optical character recognition (OCR) systems that can convert them.
OCR is the process of extracting characters from an image and converting each extracted character into an American Standard Code for Information Interchange (ASCII), Unicode, or computer editable format. Handwritten character recognition (HCR) involves converting a large number of handwritten documents into a machine-editable document containing the extracted characters in the original order. Therefore, technically the main steps of handwritten text recognition are image acquisition, pre-processing, segmentation, feature extraction, classification, and/or possibly post-processing .
Generally, there are two types of handwriting text recognition systems which are offline and online text recognition. Online text recognition is applied to data that are captured at present or real-time. For online text recognition, information like pen-tip location points, pressure, and current information while writing is available which are not available in the case of offline recognition. Thus, online text recognition is easy when it is compared to offline recognition . In offline text recognition, a scanned image or image captured using a digital camera is used as an input to the software to be recognized . Once the images are captured, some pre-processing activities are applied to them in order to recognize the characters accurately. Hence, offline recognition requires pre-processing activities on the images, and it is considered as a difficult operation than online recognition.
There are many different kinds of classification methods which have been proposed in different researches [7, 8, 9]. Every approach has its own advantages and disadvantages, so it is difficult to select a general and efficient classification approach . Many researchers proposed different systems or approaches in order to improve the efficiency of the recognition process. HCR involves many relevant phases or stages like data acquisition, pre-processing, classification, and post-processing. Each phase has its own objectives, and its efficiency defines the accuracy of the next phases and finally the overall recognition process.
Ancient script or document recognition is far harder than the printed character recognition and handwritten character recognition processes because of aging, staining, ink quality, etc. . In , the researchers conducted an ancient Devanagari document recognition using different kinds of character classification approaches. The approaches used by the researchers are ANN, fuzzy model, and support vector machine (SVM). They have achieved an accuracy of 89.68% using a neural network classifier and 95% accuracy using a fuzzy model classifier for the numerals and 94% using SVM and 89.58% using multilayer perceptron (MLP) for the alphabets.
In this paper, we have proposed an offline ancient Geez document recognition system using deep convolutional neural network in which the convolutional neural network integrates an automatic feature extraction and classifications layers. The proposed system includes the pre-processing stage of digitized ancient images, the segmentation stage to extract each character, and the feature extraction within the convolutional neural network architecture.
2 Image preparation
2.1 Image acquisition and pre-processing
Image acquisition phase is the first step for any image recognition system, also for character or text recognition systems. Digital scanner can be an effective tool for digitizing documents with high resolution. However, it is impossible to use digital scanners in ancient documents, while touching is forbidden to these documents because it can be harmful to them. Thus, digital camera is used to capture and digitize ancient documents.
During this process, several types of problems occur on the images and thus they should pass through the pre-processing stages. The purpose of these stages is to remove irrelevant pixels to form a better representation of the images and make them ready for the subsequent stages. This stage involves different sub-processes such as grayscale conversion, noise reduction, binarization, smoothing, and skew detection and correction.
2.1.1 Grayscale conversion and denoising
Captured RGB images are converted to grayscale in order to decrease the intensity dimension of image. This provides more efficient input data for post-steps and decreases computational time which is one of the common steps in all character or text recognition steps.
2.1.2 Binarization and skew detection
After denoising process, it is required to binarize document image in order to prepare it for segmentation or feature extraction process. This is one of the vital part of document recognition that effects recognition efficiency of the system.
Several binarization methods have been proposed as global [17, 18] and local [19, 20, 21] to enhance or to segment document images. It is a crucial part to determine which method is superior, and different researches have been conducted to determine superior method for document images . But it is mentioned that local methods have more disadvantages than global ones, especially because of the determination of their kernel size. Small kernel size adds additional noise, and large kernel size acts as global methods. Thus, global methods are preferred in document image binarization . Commonly, Otsu method  was suggested as the most stable and accurate method and determined to be used in this step of proposed system .
The Otsu method is intended to find a threshold value that can separate the foreground of the image from background so that it can minimize the overlap that occurs between the white and black pixels. It used discriminant analysis to divide the foreground and background.
Then, image negatives are applied in order to get the characters' pixels as white and the background as black.
Segmentation is a critical step in the handwritten recognition process, and it highly affects the recognition process. Usually, the segmentation process involves three steps which are line segmentation, word segmentation, and character segmentation. For each segmentation process, contour analysis plays a greater role in the proposed system. Basically, for line segmentation and character segmentation, an special technique of adding additional foreground colors horizontally and vertically is used. In general, filling additional pixel is also called dilation and this technique is used for line segmentation. However, the word segmentation is not conducted in the segmentation process because of the nature of Geez documents. The methods used in the proposed system are described as follows.
Scan the binary image while updating P, where P is a counted sequential number of the last found outer most border.
Represent each hole by shrinking the pixels into a single pixel based on a given threshold value t, where t is a minimum perimeter of a character or line in line segmentation.
Represent the outer border by shrinking its pixels into a single pixel based on a given perimeter or threshold value t.
Place the shrink outer border pixel to the right of the shrank hole pixel.
Extract the surrounding relations connected components, i.e., holes and outer borders.
3 Feature extraction and classification
After the segmentation process, the system has all of the isolated characters and ready to extract the features that will uniquely represent the characters.
Convolutional neural networks are composed of two parts. The first part contains a number of convolutional layers which are a special kind of neural network that is responsible for extracting features from the input image, and the second part includes dense neural layers which are required to classify the features extracted from the convolution layers.
Convolutional layer specifications of the system
Deep layer specifications of the system
28 (number of classes)
4 Experimental results
In this section, dataset and performed experiments will be explained in details.
Finally, the prepared dataset is separated into three parts as training, testing, and validation sets. 70% of dataset which is 16,038 characters of 22,913 is used for training, 20% for testing (4583 characters), and 10% (2292 characters) for validation.
For line and character segmentation, the applied algorithm accepts the binary images that have passed through all the pre-processing stages. The algorithm has been tested by a number of documents images, and its accuracy was as expected. However, in some cases when there is a connection among the lines the algorithm was unable to segment the lines accurately. Basically, the connection that exists among the lines was supposed to be removed on the noise removal and morphological transformation stages.
Obtained results of the proposed system
Comparison of the results with related research
Test accuracy (%)
Highest of proposed system
Highest of Siranesh and Menore
Lowest of proposed system
Lowest of Siranesh and Menore
It is excessively difficult to find related research on ancient Geez document recognition. One of the rarely conducted research about this subject is by Siranesh and Menore . Therefore, obtained results are compared with this results obtained in the their research. Table 4 shows the comparison results of the proposed system and other research.
Ancient document recognition has a variety of challenging problems such as noisy images, degraded quality, types and characteristics of ink and digitization processes. Thus, effective pre-processing phase and adaptive classification is required to obtain superior results than other researches. This paper focused on the development of ancient Geez document recognition system using a deep convolutional neural network. The proposed system involves pre-processing, segmentation, feature extraction, and classification stages. A dataset was prepared and used for training and testing of the proposed system.
Most of the documents were difficult even to recognize through naked eyes; however, the proposed system obtained an optimal accuracy with applied pre-processing steps by 99.389% and loss of 0.044. This proves that the proposed system can be an effective way for document recognition and particularly for ancient documents.
More comprehensive dataset may increase this accuracy, and future work will include this with implementation of other deep learning and classification models.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
- 1.Sahle E (2015) Top ten fascinating facts about Ethiopia Media has not told you. Retrieved 2 Jan 2019 from https://africabusiness.com/2015/10/29/ethiopia-4/
- 2.Britannica (2015) The editors of encyclopedia, Geez language. Retrieved 21 March 2019 from http://www.britannica.com/topic/Geez-language-accordion-article-history
- 3.Bender M (1971) The languages of Ethiopia: a new lexicostatistic classification and some problems of diffusion. Anthropol Linguist 13(5):165–288Google Scholar
- 5.Ahmad I, Fink G (2016) Class-based contextual modeling for handwritten Arabic text recognition. In: 2016 15th international conference on frontiers in handwriting recognition (ICFHR)Google Scholar
- 6.Shafii M (2014) Optical character recognition of printed Persian/Arabic documents. Retrieved 16 Dec 2018 from https://scholar.uwindsor.ca/etd/5179
- 7.Kavallieratou E, Sgarbas K, Fakotakis N, Kokkinakis G (2003) Handwritten word recognition based on structural characteristics and lexical support. In: 7th international conference on document analysis and recognitionGoogle Scholar
- 8.Nasien D, Haron H, Yuhaniz SS (2010) Support vector machine (SVM) for English handwritten character recognition. In: 2010 2nd international conference on computer engineering and applicationsGoogle Scholar
- 11.Laskov L (2006) Classification and recognition of neume note notation in historical documents. In: International conference on computer systems and technologiesGoogle Scholar
- 13.Bigun J (2008) Writer-independent offline recognition of handwritten Ethiopic characters. ICFHR 2008Google Scholar
- 14.Siranesh G, Menore T (2016) Ancient Ethiopic manuscript recognition using deep learning artificial neural network (Unpublished master’s thesis). Addis Ababa UniversityGoogle Scholar
- 15.Yousefi M, Soheili M, Breuel T, Kabir E, Stricker D (2015) Binarization-free OCR for historical documents using LSTM networks. In: 2015 13th international conference on document analysis and recognition (ICDAR)Google Scholar
- 16.Pal U, Wakabayashi T, Kimura F (2007) Handwritten Bangla compound character recognition using gradient feature. In: 10th international conference on information technology (ICIT 2007)Google Scholar
- 17.Kittler J, Illingworth J (1986) Minimum error thresholding. Pattern Recognit 19(4):4147Google Scholar
- 18.Esquef IA, Mello ARG, de Albuquerque MP, de Albuquerque MP (2004) Image thresholding using Tsallis entropy. Pattern Recognit Lett 25:10591065Google Scholar
- 19.Sauvola J, Pietikainen M (2004) Adaptive document image binarization. Pattern Recognit 33:225236Google Scholar
- 21.Chen Y, Leedham G (2005) Decompose algorithm for thresholding degraded historical document images. IEE Proc Vis Image Signal Process 152(6):702714Google Scholar
- 22.Sekeroglu B, Khashman A (2017) Performance evaluation of binarization methods for document images. In: Proceedings of the international conference on advances in image processing, pp 96–102Google Scholar
- 23.Khashman A, Sekeroglu B (2007) A novel thresholding method for text separation and document enhancement. In: 11th Panhellenic conference in informatics, pp 323–330Google Scholar