Keywords

1 Introduction

Handwritten mathematical symbols recognition is an essential component of handwritten mathematical expressions recognition which could convert the handwritten mathematical symbols images or traces to specific styles which could be shown and edited in computers, for example LATEX. This task, which has both offline and online model, is still a great challenge owning to its large scale classes, great differences in handwritten styles and very similar symbols. The input of online handwritten mathematical symbols recognition is the timing sampling point sequences gotten from pen-based or touch-based devices, such as smartphones and tablets, while in offline model the input is the images of symbols after written.

Currently, online handwritten mathematical symbols recognition has been studied widely and has achieved great performance. Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) has been held 5 times from 2011 to 2016 [1,2,3], attracting the researchers around the world. CROHME represents the highest performance of online handwritten mathematical expression recognition. Recognition of handwritten mathematical symbols has become an isolated task of the competition since CROHME 2014. Owning to the available tracing information, online data can be converted to offline images. In recent years, researchers used the offline features extracted from the symbol images as an auxiliary to recognize the online mathematical symbols and got great achievements. Álvaro et al. [2, 4] combined 9 offline features including PRHLT and FKI features and 7 online features extracted from symbol images and the original online data separately. They used the Bidirectional Long Short Term Memory Recurrent Neural Networks (BLSTM) to classify the features and achieved 91.24% recognition rate in CROHME 2014. Dai et al. [5] used Convolutional Neural Network (CNN) and BLSTM to classify the symbol images and online data separately and combined the results and got 91.28% in CROHME 2014 test set. Davila et al. [6] also used a combination of online features such as normalized line length and covariance of point coordinates and offline features such as 2D fuzzy histograms of points and fuzzy histograms of orientations of the lines to recognize online symbols. MyScript [3], the winner of CROHME 2016 also extracted both online and offline features and processed with a combination of Deep MLP and Recurrent Neural Networks. MyScript achieved 92.81% in CROHME 2016 test set and this is the best result in that set as far as we know.

Nevertheless, researchers didn’t give much attention on the recognition of offline handwritten mathematical symbols and little work was published. Since the datasets of offline handwritten mathematical symbols are rare, online data of CROHME were used by Ramadhan et al. [7] to generate symbol images for offline symbol recognition. Ranadhan et al. designed a CNN model that was trained using the images converted from CROHME 2014 training set and got 87.72% accuracy in CROHME 2014 test images drawn from online data [7]. However, due to the absence of the features of online data and the rough designed network architecture, the accuracy of [7] is obviously lower compared to the online handwritten mathematical symbols recognition results.

In recent years, convolutional neural network that was proposed by LeCun [8] for offline handwritten digits recognition has enjoyed a great success in lots of computer vision tasks, such as image recognition [9,10,11,12], object detection [13, 14] and semantic segmentation [14]. In this paper, we apply CNN to the recognition of offline handwritten mathematical symbols. And we design a deep and slim CNN architecture denoted as HMS-VGGNet. Previous research results have shown that the deeper the network is, the better results the network gets [9,10,11,12]. However, when the network goes deeper, it becomes harder to train and the model size usually grows larger. To overcome the difficulties of training and to keep the model size reasonable HMS-VGGNet which is elaborately designed for the recognition of offline handwritten mathematical symbols has applied Batch Normalization (BN) [15], Global Average Pooling (GAP) [16] and very small convolutional kernels. Considering the lack of offline data and for the convenience of comparing results, we use both the images drawn from CROHME dataset and the data of HASYv2 [17] to train and evaluate our models. As shown in our experiments, HMS-VGGNet raises the accuracy of offline handwritten mathematical symbols recognition significantly.

The rest of the paper is organized as follows. In Sect. 2, we give a brief introduction to BN, GAP and the benefits of 1 × 1 and 3 × 3 convolutional kernels. The details of the datasets used in our experiments are shown in Sect. 3, and our network configurations are present in Sect. 4. In Sect. 5, our training methods, experiments results and analyses are presented. Section 6 concludes the paper.

2 A Brief Introduction of BN, GAP and Very Small Convolutional Kernels

2.1 Batch Normalization

In the training process of CNN, it is especially hard when the network goes deeper by the fact that the inputs to each layer are affected by the parameters of all preceding layers [15]. Each layer in the network needs to adapt to the change of the inputs distribution, making the training process difficult and slow. Batch Normalization with benefits of accelerating training and achieving better performance is a solution of this problem by guaranteeing the inputs distribution of each layer stable.

In order to achieve the goal, BN takes two steps of input data processing. Firstly, BN normalizes the inputs distribution of each layer in every training step to make it with the mean of 0 and the variance of 1. For one dimension \( x^{\left( k \right)} \) of the input x, BN normalizes the input by

$$ \hat{x}^{\left( k \right)} = \frac{{x^{\left( k \right)} - E\left[ {x^{\left( k \right)} } \right]}}{{\sqrt {Var\left[ {x^{\left( k \right)} } \right]} }} $$
(1)

where \( E\left[ {x^{\left( k \right)} } \right] \) and \( Var\left[ {x^{\left( k \right)} } \right] \) are the mean and variance of \( x^{\left( k \right)} \). However, this normalization step may destroy what the preceding layer can represent. To recover the features that should be learnt by the preceding layer, BN sets two parameters \( \gamma \) and \( \beta \) to learn in the second step. By the processing of

$$ y^{\left( k \right)} = \gamma^{\left( k \right)} \hat{x}^{\left( k \right)} + \beta^{\left( k \right)} $$
(2)

BN can finally make the inputs distribution of layers stable.

2.2 Global Average Pooling

Fully connected layers, following the convolutional or pooling layers, are common in classical CNN models such as LeNet-5 [8], AlexNet [9] and VGGNet [11]. However, fully connected layers are easy to overfit because of the huge number of parameters. In 2013, Lin et al. [16] proposed a new method called global average pooling to replace fully connected layers. In the layer of global average pooling, all the parameters in one feature map are averaged to generate the result, as illustrated in Fig. 1.

Fig. 1.
figure 1

Process of global average pooling

GAP layers have 3 benefits: (1) There are no extra parameters in GAP layers thus overfitting is avoided at GAP layers; (2) Since the output of GAP is the average of the whole feature map, GAP will be more robust to spatial translations; (3) Because of the huge number of parameters in fully connected layers which usually take over 50% in all the parameters of the whole network, replacing them by GAP layers can significantly reduce the size of the model, and this makes GAP very popular in model compression [18].

2.3 1 × 1 and 3 × 3 Convolutional Kernels

In recent years, 1 × 1 and 3 × 3 filters are widely used in new CNN models [10,11,12, 18, 19] for their benefits of reducing computations, pruning parameters and improving accuracies.

As a result of keeping the size of feature maps and reducing the number of feature maps with little effect on accuracies, 1 × 1 convolutional layers were used to reduce parameters and avoid computational blow up in [10, 19]. At the same time, 3 × 3 filters are the smallest filters that could capture the notion of left/right, up/down and center. Although the receptive fields of 3 × 3 filters are small, a few continuous 3 × 3 layers can get the same receptive field of bigger filters, for example, a stack of two 3 × 3 convolutional layers has an effective receptive field of 5 × 5, with the advantages of deeper layers and fewer parameters [11, 19].

3 Datasets

In our experiments we use the images converted from CROHME online data and the images of HASYv2 dataset to train and evaluate our models for the lack of offline data and the convenience of comparing results. CROHME is the most commonly used dataset when recognizing handwritten mathematical symbols and HASYv2 which has 151 k handwritten mathematical symbol images is the biggest public offline handwritten mathematical symbols dataset to our best knowledge.

3.1 CROHME Offline Data Generation

There are 101 different classes of mathematical symbol in CROHME 2016 dataset. The online data is given in Ink Markup Language (InkML) [20]. In the InkML file, a symbol S is consisted with a set of trances \( \left\{ {T_{1} ,T_{2} , \ldots ,T_{n} } \right\} \). Each trace \( T_{i} \left( {i = 1, \ldots ,n} \right) \) consists of a set of timing sampling points \( \left\{ {p_{i1} ,p_{i2} , \ldots ,p_{im} } \right\} \), and each point \( p_{ij} \left( {i = 1, \ldots ,n;j = 1, \ldots ,m} \right) \) records its position. When generating symbol images from online data, we connect the points \( p_{ij} \) and \( p_{ij + 1} \) from the same trace with a single line and finish the generation after all the traces from the same symbol are drawn. Due to the different data acquisition devices used in CROHME, the size of symbols differs a lot. In our generation approach, as shown in Algorithm 1, we normalize the symbols size.

Since the aspect ratio, which is an important feature of symbols, differs a lot from different mathematical symbols, the longer side of the images we get from Algorithm 1 is 70 pixels while the shorter is different from each other. We expand image I with white pixels to make its size to 70 × 70. Taking into account that ‘COMMA’, ‘.’ and ‘\prime’ are relatively small in real handwritten symbol images, the longer side are fixed to 16 pixels when drawing these symbol images. After generation we expand these images with white pixels to 70 × 70 pixels. At last we resize the images generated from online data to 48 × 48. The first row in Fig. 2 shows some samples of the generated images.

figure a
Fig. 2.
figure 2

Samples of the dataset used in our experiments. The first row shows the images drawn from online data, the second and third rows are the samples generated by elastic distortion. Samples of the second and third rows are generated by the images of the first row in the same column.

3.2 Data Enrichment

As a result of the expressive power of deep networks, overfitting is a common problem that is hard to deal. Researchers have proposed some methods to prevent overfitting such as Dropout [21] and Batch Normalization. However, the most effective way to prevent overfitting is enriching the training set to make the networks learn more universal features. In the training set of CROHME 2016 there are only 85802 symbols and there are 369 classes of 151 k training samples in HASYv2. In addition to the lack of training samples, the distributions of training set of CROHME and HASYv2 are also bias, for example the sample number of symbol ‘-’ is 8390 and there are only 2 samples of ‘\( \exists \)’ in the training set of CROHME 2016. These drawbacks of the datasets will pull the accuracies down.

To avoid the drawbacks we use elastic distortion [22] to enrich our training set. There are two random matrices \( \Delta {\text{x}}\left( {x,y} \right) = rand\left( { - 1,1} \right) \) and \( \Delta {\text{y}}\left( {x,y} \right) = rand\left( { - 1,1} \right) \) representing the horizontal and vertical axis displacement of the pixel \( \left( {x,y} \right) \) in elastic distortion algorithm. The matrices are convolved with a Gaussian kernel, whose size is n × n and standard deviation is \( \upsigma \). All the pixels in the original image are moved following the convolution results \( \Delta {\text{conv}}\_{\text{x}} \) and \( \Delta {\text{conv}}\_{\text{y}} \). After the movements we rotate the images by a random angle \( \uptheta \). In this paper, \( \upsigma \) = 5, n = 11 and \( \uptheta \) is in the range of \( - 25^\circ \sim25^\circ \). Using elastic distortion, we have enriched the samples of each class to about 4000 and 1000 in CROHME and HASYv2 training sets. The second and third rows in Fig. 2 show several samples generated by elastic distortion.

As HASYv2 covers most of the CROHME symbol classes, we use the samples of HASYv2 whose class is also included in CROHME when conducting the experiments of CROHME. Since the size of images in HASYv2 is 32 × 32, the images from HASYv2 used in CROHME experiments are resized to 48 × 48. We use the symbols of CROHME 2013 test set as the validation set and the test set of CROHME 2014 and 2016 to evaluate our models in CROHME experiments. In HASYv2 experiments, we use cross validation as suggested in [17]. Table 1 shows the details of the datasets used in our experiments.

Table 1. Datasets used in our experiments

4 Network Configurations

In order to make the effects of BN, GAP and small kernels clear, we have designed four networks with similar architecture and the details of these networks are shown in Table 2. Network C is the baseline of our contrast experiments. Network A uses fully connected layers while global average pooling layers are used in C. The only difference of B and C is that C uses Batch Normalization while B doesn’t. Compared to C, D adds two extra 1 × 1 convolutional layers to reduce the dimension.

Table 2. HMS-VGGNet configurations (shown in columns). The detailed differences are shown in the contents of this section.

In Table 2, the convolutional layer parameters are denoted as “Conv-(filter size)-(number of filters)-(stride of filters)-(padding pixels)”. All max-pooling layers in our network are performed over 2 × 2 pixel window, with stride 2. All convolutional/fully connected layers are equipped with the rectification non-linearity. And all convolutional layers are equipped with Batch Normalization before ReLU except those in network B. We omit the ReLU and BN for brevity in Table 2. The ratios of all the Dropout operations used in our networks are 0.5.

The architecture of the networks, which is denoted as HMS-VGGNet, is inspired by VGGNet [11]. However, there are several improvements in our networks for the handwritten mathematical symbols recognition task compared with the original VGGNet. Firstly, the images of handwritten symbol images are much smaller and simpler than the natural images used in VGGNet, so we have pruned several layers and filters to fit our task. The second improvement is that Batch Normalization layers are added after all the convolutional layers of Net A, C and D to accelerate the training process and improve the accuracies. Thirdly, we use global average pooling layers to replace the fully connected layers in B, C and D and reduced the model size by a large margin. Besides, we also apply 1 × 1 filters which could reduce the model size further and effect the accuracy negligibly to Network D. All the conclusions above will be proven in the experiments of Sect. 5.

5 Experiments

5.1 Experiments in CROHME Dataset

Training Methods.

Our experiments were conducted on Caffe framework [23] using a GTX 1060 GPU card. The training used stochastic gradient descent with 0.9 momentum. The initial learning rate was 0.01 and reduced to 10% every 40k iterations. The batch size was set to 40 and the training stopped after 202k iterations (around 20 epochs). Besides, we used the “xavier” algorithm to initialize the weights of all the convolutional layers in our networks.

Results and Analyses.

In the CROHME experiments, we use the symbols of CROHME 2013 test set as the validation set. And we use the test sets of CROHME 2014 and 2016 to evaluate the models. Table 3 shows the results of the four networks in these datasets.

Table 3. Accuracies of our models in CROHME datasets

All the four networks have achieved great performance in the three datasets. The Top-1 recognition rates of Network C in the CROHME 2014 and CROHME 2016 test sets are about 0.5% to 1% higher than those of Network A and B, while the Top-3 and Top-5 accuracies are also have an improvement about 0.1%–0.5% compared with Network A and B. The gaps between the recognition results of C and D are rather small. C has a 0.39% and 0.15% higher Top-1 performance than D in CROHME 2014 and 2016. And the gaps of Top-3 and Top-5 recognition rates don’t exceed 0.1%. These results give strong evidence that the usage of BN and GAP can get better accuracies.

Table 4 elaborates the parameter scales of the four models. Replacing the fully connected layers by global average pooling layers has a sharp decrease of model size. The number of parameters of C is only 44.92% of that of A. After applying the 1 × 1 convolutional layers, D has a further reduced model size than C and it doesn’t have much effect on accuracies compared to C, as illustrated in Tables 3 and 4.

Table 4. Parameter Scales of HMS-VGGNets. The model size is the size of caffemodel file generated by Caffe. Since the only difference of B and C is the usage of BN, the parameter scales of B and C are the same.

Combining isolated classifiers is an effective way to raise the accuracies which is also used in [4, 5, 9,10,11,12]. In order to increase recognition rates further, we have combined network C with D. The ensemble method is averaging the results of the two models. The results of our methods and existing systems in CROHME are shown in Table 5. Our networks outperform all the other systems in CROHME 2014 test set with a 91.82% Top-1 recognition rate. In CROHME 2016 test set, our networks have achieved the second place with 0.39% less than MyScript, the winner of CROHME 2016, who has used both online and offline features. The accuracies of our networks have a significant increase compared with the existing methods that use offline features only in CROHME dataset as shown in Table 5.

Table 5. Top-1 accuracies of our networks compared with other systems on CROHME. Top 3 accuracies in each dataset are bolded.

Table 6 shows the average computational time of our four networks. Although network C and D spend more time than network A and B, our four networks are all quite fast in our CROHME and HASYv2 experiments.

Table 6. Computational time of our networks in our experiments

Although we have achieved rather good results in CROHME dataset, there are two questions shown in our experiments.

Question 1: Why the results of A and B in CROHME 2013 (validation set) are only slightly lower or even higher than those of C and D?

There are some symbol classes difficult or even impossible to discriminate without context information due to the very confusable handwritten styles, such as ‘x-X-×’, ‘1-|’, ‘0-o-O’. We have analyzed the test sets of CROHME 2013, 2014 and 2016 and find 24 symbol classes that are difficult to classify. The percentage of these classes of CROHME 2013 test set is higher than those of CROHME 2014 and 2016 test sets, as shown in Table 7. This makes it harder to classify in CROHME 2013 test set, so the gaps of recognition rates of A, B, C and D are relatively small. This is also the reason why the Top-1 accuracy is significantly lower than Top-3 and Top-5 accuracies. Some misclassified symbols are illustrated in Fig. 3.

Table 7. Percentage of symbols hard to classify. These symbol classes are ‘COMMA, (, 0, 1, 9, c, C, ., g, l,/, o, p, P, \prime, q, s, S, \times, v, V, |, x, X’
Fig. 3.
figure 3

Misclassified Samples of our networks

Question 2: Why the results of our methods are still less than that of MyScript?

Since online data has the tracing information while offline data doesn’t, online data has advantages when classifying symbols who have similar shapes and different writing processes such as ‘5’ and ‘s’. Our networks only use offline features so it is hard for them to classify those symbols.

5.2 Experiments in HASYv2 Dataset

When conducting experiments in HASYv2 dataset, we have used cross validation to test our models as suggested by [17]. There are 10 folds in HASYv2, so we have evaluated 10 times using different folds in our experiments. The training method is almost the same as that in Sect. 5.1. We totally trained 185 k iterations (around 20 epochs) and divided the learning rate by 10 after every 35 k iterations.

Since the HASYv2 dataset is proposed lately, the other experiments on HASYv2 are still rare. We have compared our results with the model baselines in [17], as shown in Table 8. All the four networks proposed in this paper have higher accuracies than the baselines. Besides, the parameter scales of our networks are significantly smaller than TF-CNN which keeps the highest accuracy in the baselines due to the usage of small convolutional filters, GAP and well-designed architectures. Our models have achieved the state-of-the-art accuracy in HASYv2 dataset to our best knowledge.

Table 8. Accuracies of our networks compared with the model baseline of HASYv2

There are 369 classes in HASYv2 dataset and it has more classes that are hard to discriminate than CROHME, such as \( {\mathcal{H}} \), \( {\mathbb{H}} \) and H; → , ↦ , ⇀, and ↪. Besides, some symbols are even difficult to tell in printed form, such as \Sigma and \sum. These difficulties make great challenges for our task, so our four networks perform similarly and don’t get better accuracies any more.

6 Conclusion

In this paper, we have elaborately designed a CNN architecture called HMS-VGGNet for offline handwritten mathematical symbols recognition. Experiments show that our models have achieved very competitive results in CROHME (91.82% and 92.42% Top-1 accuracy in CROHME 2014 and 2016 and around 99% Top-3 and Top-5 accuracies for both datasets) and HASYv2 (85.13% Top-1 accuracy, 97.38% Top-3 accuracy and 98.52% Top-5 accuracy) datasets using this slim and deep architecture. From our experiments results we also analyse the benefits of BN, GAP and very small filters. We will use these networks in our offline handwritten mathematical expression recognition system in the future. Since online data can generate offline images, our networks can also be used as an auxiliary method for online handwritten mathematical symbols recognition to improve accuracies further.