Keywords

1 Introduction

The characters of Tibetan historical document cover modern Tibetan and Sanskrit Tibetan, so the number of characters is more than 7,000. The similarity between characters of Tibetan historical document is high and there are a lot of similar characters, such as “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, etc., which bring a larger technical difficulty to character recognition. In addition, many Tibetan historical documents are carved on the woodblock, which was engraved by hands, so the nicks are usually uneven. Therefore, the late manual inkiness is uneven, for example, the deep groove has less ink, leading to a loss part strokes of character of historical documents; Or a loss of strokes caused by the Image preprocessing of the ancient books, for example, “ ” “ ” “ ” are changed into “ ” “ ” “ ”, which undoubtedly increases the difficulty of character recognition of Tibetan ancient books. At present, there is a lack of researches on the image and character recognition of Tibetan ancient books.

SVM method [1], hidden Markov model [2] and so on are more widely used in character recognition. Convolution neural network is a deep neural network which has a local connection between layers and which was put forward by American scholar LeCun. After the appearance of convolution neural network (CNN), using a variety of types of deep neural network models to analyze and recognize documents has become a research hotspot in this field. CNN has been successfully used in many areas, such as the recognition of handwritten digits, English characters, Chinese character and so on. Among 107 papers collected in ICFHR meeting held in late October 2016, whose image analysis and retrieval [3], text line segmentation [4], feature extraction [5], classification recognition processing [6] and other links involved in Chinese, English, Japanese, Mongolian, Arab, Bangladesh, etc., and more than half of the papers applied the deep learning technology. The Tibetan language includes modern Tibetan language (also known as Tibetan language or local Tibetan language) and Sanskrit Tibetan language (the Tibetan transferring form of Sanskrit). The print form of modern Tibetan characters has been studied a lot, such as professor Ou Zhu at Tibet University, professor Huang Heming at Qinghai Normal University, professor Li Yongzhong at Jiangsu University of Science and Technology, etc. And the team of professor Ding Xiaoqing at Tsinghua University studied, researched and developed the Tibetan character recognition system of practical multifont printing of more than 592 characters [7, 8], which has been well applied. The literature [9,10,11,12,13] shows that, for handwritten character recognition, the statistical characteristics of characters are the best, and for the off-line handwritten Chinese character recognition, gradient feature has a high recognition rate [14,15,16]. The researchers successfully applied the convolution neural network to digit recognition [17, 18] and character recognition [19, 20] in the natural scene, and pointed out that the convolution neural network could learn the characteristics which are better than artificial design [21, 22]. The literature [23] applied the deep convolution neural network to the recognition of offline handwritten similar characters, and the recognition rate is more significantly improved than traditional method. Therefore, this thesis proposes to use the deep convolution neural network to conduct the recognition of similar Tibetan characters. In contrast, there is no report about the application of deep convolutional neural network in the character recognition research of Tibetan ancient books.

Due to the irreproducibility of Tibetan ancient books, sample extraction of Tibetan characters can only be extracted from the document and image itself of Tibetan ancient books, and the project team has realized the preprocessing, binarization and layout analysis of document and image of Tibetan ancient books, and completed the document character segmentation. Due to the printing requirement of “soft character fine alignment and fine carving” in the Phyi dar of Tibetan Buddhism, most of the Buddhist texts adopted Uchen Script. The striking feature of Uchen Script is that the top stroke of each letter is horizontal and straight, and the base line of the character arrangement is on a straight line. See Fig. 1. The baseline (baseline 1, baseline 2, etc. expressed by the dotted line in Fig. 1) is adopted to further segment into the vowel part above the baseline. For example, baseline 1 is adopted to express the character “ ”, “ ” and so on above the baseline; The part under the baseline, such as “ ”, “ ”, “ ”, etc. There are fewer types of characters above the baseline, about a dozen types, and there are also fewer types of similar characters. This thesis mainly studies the similar characters of the characters under the baseline.

Fig. 1.
figure 1

Document image of Tibetan ancient books (a part)

2 Construct Sample Set of Similar Characters

In view of the current situation that there is no character sample of Tibetan ancient books, the following methods are proposed to classify and label the similar character sets.

In view of the Tibetan characters which have been segmented early, first of all, their characteristics are extracted, and three features about extraction in this paper are:

  1. (1)

    Gradient 8 direction characteristics (64 D)

    First of all, the character image of Tibetan ancient books is normalized to 136 × 50, and in order to ensure the less distortion of the image, bicubic interpolation is adopted for the deformation process. Then the uniform grid of 4 × 2 is used to evenly divide the original image into 8 small grids according to the size, and then the gradient feature of character pixels in each small grid is calculated. Then, the gradient is decomposed into 8 directions in accordance with the method of Bai to form 8 D gradient direction characteristics [24], and then 8 small grids features are combined to get 64-dimensional gradient direction characteristics.

  2. (2)

    Features of 8 × 8 grid (64 D)

    In the first place, the character image of Tibetan ancient books is transformed into 64 × 64, and in order to ensure a less distortion of the image, the deformation process adopts bicubic interpolation. Then, the original image is evenly divided into 64 small grids by using the even grid of 8 × 8, and later, the percentage of the characters in each small grid in the total pixel is calculated, and the characteristics of 64-dimension are obtained.

  3. (3)

    Peripheral features of characters (64 D)

    The grids which are divided and extracted by using feature (2) to continue to extract the pixel periphery features from top to bottom, from bottom to top, from left to right and from right to left. The features of four directions are combined into one-dimensional features, and 64 small grids have a total of 64-D features.

After integrating the above three characteristics, there are a total of 192 D feature dimensions. Through principal component analysis, the dimension is reduced to 80 D features. k-means clustering is used to record the filename of each character and the corresponding relationship of the distance of each centroid. According to the sorting characters in the class, the former k characters which are divided into the same class and which are in a close range are divided into similar characters, constituting a set of similar characters. MATLAB is used to copy the image of similar characters in the same file, and the distance information is added before the image’s original file name. Then, according to the sort of file name, the image of the same category of characters can be gathered as far as possible (Fig. 2).

Fig. 2.
figure 2

Construction process of similar character set of Tibetan Uchen script

3 Convolution Neutral Network (CNN)

Convolution neural network (CNN) is a neural network which is specially used to deal with similar network structure data, such as image data which can be considered as a two-dimensional pixel grid. CNN shows a high recognition rate in 2 D image recognition application, and its network structure is highly invariant to translation, scaling, tilting or other forms of deformation. CNN directly conducts the learning and character classification for the characteristics of original image, and it doesn’t need too much pre-processing and feature extraction of the original character image, so it is an end-to-end recognition system, which effectively avoid the defects of losing the details of similar characters caused by artificial feature extraction and feature selection in advance. This thesis adopts the following CNN network structure, as shown in Fig. 3.

Fig. 3.
figure 3

CNN network structure

Convolution neural network is composed of the convolution layer and the sampling layer, and each layer is composed of multiple feature maps. Each pixel (neuron) of convolution layer is connected with a local area of the upper layer, and it can be viewed as a local feature detector. Each neuron can extract primary visual features such as direction line segments, angular point, etc. At the same time, this local connection makes the network have fewer parameters, which is beneficial to training. There is usually a sampling layer behind the convolution layer, in order to reduce the resolution of the image, and the network have a certain displacement, scaling and distortion invariance. For the convolution layer, the feature graph of the previous layer is conducted with a convolution operation with multiple group of convolution masks and then the feature graph of the layer is obtained through the activation function. The calculation form of the convolution layer is as follows:

$$ a_{j}^{l} = \sigma \left( {\sum\nolimits_{{i \in M_{j} }} {a_{i}^{l - 1} } *w_{ij}^{l} + b_{j}^{l} } \right) $$
(1)

In Eq. (1), \( l \) is the number of layers where the convolution layer is; \( {\text{w}} \) is convolution kernel, which is a template of 5 × 5. \( {\text{b}} \) is setover, and \( \sigma \) is activation function, that is \( 1/(1 + e^{ - x} ) \). \( M_{j} \) represents an input feature graph of the upper layer.

The sampling layer is to sample the characteristics of the upper convolution layer and get the same number of feature graphs. The training of convolution neural network is the same as that of traditional neural network, and it adopts stochastic gradient descent. The input layer is a character image of Tibetan ancient books, whose size is 28 × 28. C1 layer is the first convolution layer, which has eight feature graphs of 24 × 24, and one pixel (node or neuron) in each feature graph is interconnected with a region of 5 × 5 corresponding to the input layer. S1 layer is a lower sampling layer containing 8 feature graphs of 12 × 12, and each node in the feature graph is interconnected with a region of 2 × 2 corresponding to the feature graph in the C1 layer. C2 is the second convolution layer with 16 feature graphs, and the size of each feature graph is 8 × 8. The connection between S1 and C2 plays an important role in feature extraction. S2 is the second sampling layer with 16 feature graphs, and the size of each feature graph is 4 × 4. The last layer is the output layer with 10 nodes, corresponding to the output category, and it has a full connection with S2 layer.

4 Experiment and Result Analysis

4.1 Experiment Data

In this paper, the experimental data is the two groups of similar characters under the baseline of Tibetan characters, and each group contains 10 Tibetan character categories. The first group is a set of similar characters formed by Tibetan vertical stacks, and it is composed of “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ” and “ ”. It is represented by G1, and there are a total of 5215 experimental samples.

The second group is a set of similar characters which are composed of complete consonant characters, and it is composed of “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”. It is represented by G2, and there are a total of 24,700 experimental samples.

In order to compare the performance of CNN in the recognition of Tibetan similar characters, CNN is compared with Naive Bayes Discriminant classifier and support vector machine classifier. For Naive Bayes discriminant and SVM classification, first of all, gradient 8 direction features described in Sect. 2 are extracted to get 64 D feature vector of each sample, and then the feature vector is used to discriminate and classify. For CNN, the image of the Tibetan characters is directly compressed to the image with a resolution of 28 × 28, so as to reduce the parameters of CNN, and thus improve the training speed of the network.

4.2 Experiment Process

In the network training process shown in Fig. 3, the error reverse transform and the gradient random descent method are adopted to update the parameter w and b.

J(w, b) is used to express the error function, and the expression of updating parameters with the gradient descent method is as follows:

$$ {\text{w}}\text{ := }{\text{w}} - \alpha \frac{\partial J(w,b)}{\partial w} $$
(2)
$$ {\text{b}}\text{ := }{\text{b}} - \alpha \frac{\partial J(w,b)}{\partial b} $$
(3)

α is the descent rate control parameter, and the selection of α in the experiment is determined by adopting the test method. Finally, selecting α = 1.5 as the descent rate parameter of the system.

In order to observe the influence of different α on recognition rate, first of all, other parameters are fixed, for example, the times of circuit training are 30, because smaller number of circuit training times can save the training time, but it is enough to reflect the impact of α on the recognition rate. Different α and corresponding identification error rate are shown in Table 1.

Table 1. Different α and corresponding recognition error rate

The value of α during the experimental process is conducted according to the order from top to bottom in Table 1. The error rate in Table 1 shows that the error rate is the smallest when α = 1.5, and it is 0.2339.

4.3 Experimental Results and Analysis

The experiment adopts CNN network structure shown in Fig. 3 and uses 64 D gradient feature to conduct Naive Bayes and SVM classification. In this paper, G1 and G2 sets are conducted with K-fold cross validation (K = 10), namely, each similar set is evenly divided into 10 parts: T1, T2, T3…… T10. Each part is taken as a test set each time, and the other 9 parts are regarded as the training set. The error rate results of G1 and G2 sets are shown in Tables 2 and 3 respectively. The experimental results show that, compared with Naive Bayes and SVM recognition method, the method based on deep neural network has a lower error rate. The reason for the poor performance of SVM and Naive Bayes is that the identification information of similar Tibetan characters is lost in the process of feature extraction.

Table 2. A comparison of error rate of 10-fold cross-validation on G1 set
Table 3. A comparison of error rate of 10-fold cross-validation on G2 set

The experimental results show that, compared with Naive Bayes and SVM recognition method, the method based on deep neural network has a lower error rate. The reason for the poor performance of SVM and Naive Bayes is that the identification information of similar Tibetan characters is lost in the process of feature extraction.

In order to illustrate the recognition performance of this paper method, The average error rate comparison of different classifiers on G1 and G2 sets is shown in Fig. 4.

Fig. 4.
figure 4

Error rate of different classifiers on G1 and G2 set

Figure 4 shows this paper’s method does not need human intervention in the process of training and recognition, is a kind of end-to-end approach, as well as under the condition of less training samples to achieve ideal effect.

Figures 5 and 6 shows the error curve of T10 of G1 and T10 of G2. It can be seen that CNN has smaller error in similar character recognition with the increase of the iterations.

Fig. 5.
figure 5

T10 of G1 error curve

Fig. 6.
figure 6

T10 of G2 error curve

To further the robustness and stability of the network, In this paper randomly selects 1/10 of the sample from category of G2 set to form the test sample set (Te), and the number of test set is 2,470. In addition, it randomly selects five training sample sets (Tr1, Tr2, Tr3, Tr4 and Tr5) which doesn’t include the test sample of Te, and the size are 1, 2 times, 3 times, 5 times and 9 times of test sample respectively, and The number of training sample sets is 2470, 4940, 7410, 12350, and 22230. The recognition error rate of these five sets of data is shown in Table 4.

Table 4. A comparison of error rate of different training samples in G2 set

Table 4 shows that with the increase of the sample size, the error rate of the recognition method based on the deep neural network gradually decreases, but the error rate of NBC and SVM method fluctuates up and down. It’s clear that the network is more stable for the different sample collection, and the system has more robust robustness.

5 Conclusion

This thesis proposes that using convolution neural network to automatically learn and recognize the characteristics of similar characters of Uchen Script in Tibetan ancient books. At the same time, the similar characters of Tibetan ancient books constructed in this paper are adopted to train the model parameters, and the experimental results show that, compared with the traditional methods: (1) Deep convolution neural network can automatically learn the effective features and identify them from the pixel level, which avoids losing details caused by artificial selection and extraction of features and improves the recognition rate; (2) With the increase of the number of training samples, deep convolution neural network has a remarkable performance in reducing the error recognition rate, and the increase of training samples has an obvious effect on enhancing the recognition rate of deep neural network.