1 Introduction

Handwriting digits recognition refers to the process of transforming the ordered trajectory generated by writing on handwriting equipment into the internal code of digits. It is actually a mapping process from the coordinate sequence of handwritten trajectory to the internal code of digits. It is one of the most natural and convenient means of human–computer interaction. With the popularity of mobile information tools such as smartphones and handheld computers, handwritten digits recognition technology has entered the era of large-scale application. Handwritten digits recognition enables users to input text in the most natural and convenient way. It is easy to learn and use, and can replace keyboards or mouses. There are many kinds of devices for handwriting inputs, such as electromagnetic induction handwriting boards, pressure-sensitive hand-writing boards, touch screens, touch panels, ultrasonic pens, etc. Handwriting digits recognition belongs to the category of digits recognition and pattern recognition. In terms of the recognition process, digits recognition can be divided into two categories: off-line recognition and on-line recognition. In terms of recognition objects, it can also be divided into two categories: handwriting digits recognition and print digits recognition.

Also, it is well known that the handwritten digits recognition is a challenging problem. In recent years, there are many algorithms proposed for handwritten digits recognition. Boukharouba (2017) develops a new feature extraction technique for handwritten digit recognition based on support vector machines (SVM). During this method, the vertical and horizontal directions of a digit image are combined with the famous Freeman chain code, and the approach does not require any normalization of digits. Mohebi and Bagirov (2014) presented a convolutional recursive modified self-organizing maps (SOM) and applied it to handwritten digits recognition. The results have shown that the proposed method can lead to an improvement of the recognition rate compared with other SOM-based algorithms.

In the machine learning context, it is commonly known that each standard learning algorithm usually shows different performance on different datasets. In other words, the use of an algorithm may lead to the production of strong classifiers on some datasets but the classifiers trained on other datasets using the same algorithm may be much weaker. In the case of handwritten digits recognition, a standard learning algorithm may be capable of learning some but not all specific characteristics of handwritten digits. Also, the same classifier may show different performance on different datasets, due to the different data distribution. In addition, instances of handwritten digits usually show very diverse characteristics due to different handwriting styles of different people, even if the instances belong to the same class (Ding et al. 2018).

To address the above issue, in this paper, we propose to adopt instance-based recognition of handwritten digits in the setting of ensemble learning, towards obtaining diverse classifiers trained using different learning algorithms. The whole procedure of recognition involves using convolutional neural network (CNN) for feature extraction, adopting a correlation-based feature subset selection method for obtaining diverse feature sets and setting multi-level fusion of classifiers trained on different feature sets.

The main contributions of this paper include: (1) the use of CNN to extract more diverse features from each handwritten digit image and different feature sets are prepared through filter-based feature selection; (2) an ensemble learning framework is proposed, which involves multi-level fusion of multiple classifiers trained on different feature sets using different learning algorithms.

The rest of this paper is organized as follows. In Sect. 2, we introduce some related work in this context. We describe the proposed approach in Sect. 3. Experimental results are presented in Sect. 4 and conclusion are shown in Sect. 5.

2 Related work

This section provides a review of the applications of convolutional neural networks for image classification, an overview of handwritten digits recognition and a review of traditional machine learning methods alongside potential improvements through the use of granular computing concepts.

2.1 Convolutional neural network

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery (Plamondon and Srihari 2000). CNN is considered as an excellent tool for solving computer vision problems in a large number of fields. CNNs are widely used in modern AI systems but also bring challenges. For example, during the IEEE conference on computer vision and pattern recognition 2018, there are more than 10 papers that are based on CNN, e.g. Hui et al. (2018) proposed a state-of-the-art CNN named LiteFlowNet which can be used to improve flow estimation accuracy. To train a deep convolutional neural network with both low-precision weights and low-bit width activation,  Zhuang et al. (2018) proposed to use a two-stage optimization strategy to improve its performance. Feng et al. (2018b) proposed a new loss function named ‘Wing loss’ for robust facial landmark localisation with CNNs. During 3D shape recognition,  Feng et al. (2018a) proposed a GVCNN (group-view convolutional neural network) framework which can achieve a significant performance gain on the 3D shape recognition. And  Zhang et al. (2018) proposed a new knowledge-based semisupervised deep CNN for Facial action unit intensity estimation, and it can achieve comparable or even better performance than some common methods with smaller datasets. Beside those, CNN is also used in energy-efficient reconfigurable accelerator (Chen et al. 2017), semantic image segmentation (Chen et al. 2018) and image fusion (Acharya et al. 2017a, b). CNNs involve relatively little pre-processing compared with other image classification algorithms. This independence from prior knowledge and human effort in feature design is a major advantage.

Convolutional architectures also seem to benefit extracting features from image data. In our approach, the image features of handwritten digits are extracted using the Convolutional Neural Network architecture.

2.2 Review of machine learning methods

There are many machine learning algorithms which are used in image recognition, and the most popular ones in machine learning mainly include multi-layer perceptron (MLP) (Mirjalili 2015), random forests (Biau et al. 2009), K nearest neighbour (KNN) (Vermeulen et al. 2017), Naive Bayes (NB) (Amor et al. 2004) and C4.5 decision tree (Quinlan 1996). Also, the above machine learning algorithms have been used popularly in handwritten digits recognition tasks.

A multilayer perceptron (MLP) network is a type of feed-forward artificial neural networks and it consists of one or more fully connected layers. There are at least three layers in a MLP network (the input layer, a hidden layer and the output layer) (Yilmaz and Özer 2009). MLP does not specify the number of hidden layers, so it can choose the appropriate number of hidden layers according to their needs. There is no limit on the number of neurons in the output layer. It utilizes a kind of supervised learning techniques (Ravi et al. 2017).

Random forest can be understood as Cart tree forest, which is an integrated learning mode composed of multiple Cart tree classifiers. Among them, each Cart tree can be understood as a member, which trains a part of randomly put back from the sample set. In this way, multiple tree classifiers constitute a training model matrix. Then the samples to be classified are brought into this tree classifier, and the final classification of this sample is decided by the majority voting rule (Wager and Athey 2017). Random forest can easily identify the importance of each feature, but if there is a strong relationship between features A and B, that is to say, B can be deduced from A, then the importance of such feature is meaningless, because random forest often only gives a high value to A, and B will be much smaller (Scornet et al. 2015).

The K-nearest neighbor (KNN) algorithm is a famous statistical method for pattern recognition and occupies a considerable position in machine learning based classification algorithms. It is one of the simplest machine learning algorithms (Song et al. 2016). KNN is one of the most basic instance-based learning methods and one of the best text classification algorithms. The basic idea is that if the majority of the K instances closest to an unseen instance in the feature space (the nearest neighbor in the feature space) belong to a category, the instance also belongs to that category. The selected neighbors are instances that have been correctly classified (Zhang et al. 2017). A disadvantage of KNN is that it requires a large amount of calculation, because the distance between each instance to be classified and all known instances must be calculated to obtain its K nearest neighbors.

Bayes theorem is a very old statistical method (1763). Naive Bayes (NB) is a classification method based on Bayes theorem and independent hypothesis of characteristic conditions (Chen and Jahanshahi 2018). The step is to learn the joint probability distribution of input/output based on the “independent hypothesis of characteristic conditions”. According to this model, the output y with the maximum posterior probability is calculated by Bayesian theorem for the input. Naive Bayes (NB) has the advantages of simple implementation and high prediction efficiency. It can be used for large databases; The downside is that the prior probabilities have to be known (Amor et al. 2004).

C4.5 decision tree learning is an algorithm developed by Ross Quinlan to generate decision trees (Polat and Gne 2009). This algorithm is an extension of the ID3 algorithm developed by Ross Quinlan. The decision tree generated by the C4.5 algorithm can be used in classification problems of machine learning and data mining. Its goal is supervised learning: given a dataset, each tuple in it can be described by a set of attribute values, each of which belongs to a category from mutually exclusive ones. The goal of C4.5 is to learn to find a mapping relationship from attribute values to categories that can be used to classify new entities of unknown categories (Sathyadevan and Nair 2015).

3 Proposed framework

In this section, we provide a description of CNN based feature extraction and present an ensemble learning framework that involves multi-level fusion of multiple classifiers trained on different feature sets using different learning algorithms. We also justify how the design of the proposed framework involves the application of granular computing concepts.

3.1 CNN feature extraction

During our method, we use CNNLeNet-5 to obtain more diverse features from each handwritten digit image.

The proposed CNN feature extraction for handwritten digit images by LeNet-5 is illustrated in Fig. 1.

Fig. 1
figure 1

CNN feature extraction for the handwritten digit image

The LeNet architecture is considered as the first architecture for convolutional neural networks. We can easily see from the LeNet-5 in Fig. 1 that many feature maps are generated in each layer. So we can obtain more diverse features than using other common methods.

The LeNet-5 is an excellent architecture for handwritten digit recognition. The LeNet-5 has two parts, one is feature extraction, whereas the other one is classification which is used to classify objects. During our approach, we do not use the LeNet-5 to do classification (blue part in Fig. 1), and we only use it to extract features from images. During the classification, we use the proposed ensemble learning framework instead of a neural network that consists of fully connected layers.

Given an image of 32 × 32 × 1, firstly, a convolution layer with six 5 × 5 filters with the stride of 1 is used and an output matrix of 28 × 28 × 6  is generated. With the stride of 1 and no padding, the feature map is reduced from 32 × 32  to 28 × 28. Then average pooling with  the filter width of 2 and the stride of 2 is taken and the dimension is reduced by the factor of 2 and ends up with 14 × 14 × 6. Furthermore, another convolution layer with sixteen 5 × 5 filters is used leading to an output matrix of 10 × 10 × 16. Then another pooling layer is involved and ends up with an output matrix of 5 × 5 × 16. Therefore, we extract sixteen 5 × 5 feature maps from each image, and each feature map (5 × 5) is treated as a column vector (25 × 1). Overall, there are two convolution layers, two subsampling layers, and two fully connected layers in the LeNet-5.

3.2 Multi-level fusion of classifiers

The proposed ensemble learning framework involves multiple levels of fusion of diverse classifiers trained on different feature sets. The entire procedure of the proposed framework is illustrated in Fig. 2.

Fig. 2
figure 2

Proposed framework of ensemble learning

In particular, as shown in the feature preparation layer in Fig. 2, different feature sets can be prepared through feature extraction using different methods (but we only obtain one feature set extracted using CNN in this paper). Also, the feature set extracted using a specific method can be further processed to obtain different feature subsets using different feature selection methods. In the third layer, about training of multiple classifiers, m learning algorithms are used to train base classifiers on each feature set Fi, therefore, a primary ensemble Ei is created on each of the n feature sets as shown in the primary fusion layer. Finally, the n primary ensembles created on the n feature sets are fused further to create the final ensemble, so that a final classification is made as the output of the final ensemble as shown in the final fusion layer.

In practice, the setting of ensemble learning can be achieved even in a more flexible way than the one shown in Fig. 2. For example, some base classifiers trained on a feature set can be combined to make up a primary ensemble, which is combined further with the other base classifiers to make up a secondary ensemble. In this context, a secondary ensemble can be created on each feature set and some or all of the secondary ensembles can be fused further to make up a higher level ensemble or even the final ensemble. We will show this kind of setting of ensemble learning in Sect. 4.

The proposed ensemble learning framework is essentially designed in the setting of granular computing, which is a formalized paradigm of information processing (Pedrycz 2011; Pedrycz and Chen 2011, 2015a, b). In general, granular computing can be considered as a method of structural thinking at the philosophical level but can also be used as a strategy of structural problem solving at the practical level (Yao 2005b).

In theory, two main concepts of granular computing are referred to as granule and granularity (Liu and Cocea 2017, 2018; Liu et al. 2018). Granule is defined as a collection of smaller particles that can form a larger unit. In the context of ensemble learning, each ensemble can be viewed as a granule since it consists of multiple classifiers. While granules can be of very different sizes, the concept of granularity becomes highly needed to deal with the different sizes of different granules, that is, to involve different granules in different levels of granularity, according to the scale of their actual sizes. The proposed ensemble learning framework involves multiple levels of classifiers fusion, where each of the levels can be viewed as a specific level of granularity. In this context, a primary ensemble that only consists of base classifiers is viewed as a granule at the basic (bottom) level of granularity, whereas the final ensemble that may involve both base classifiers and lower level ensembles is viewed as a granule at the top level of granularity.

In practice, granular computing concepts are commonly used through taking one or both of the two operations, namely granulation and organization. The former operation is essentially decomposition of a whole into multiple parts in a top-down information processing manner, such as extraction of local features through the convolution layer of CNN, whereas the latter operation is essentially integration of multiple parts into a whole in a bottom-up information processing manner (Yao 2005a), such as fusion of multiple classifiers.

4 Experimental results and discussion

In this section, we report an experimental study conducted on the MNIST dataset, which is essentially a 10-class (0–9) classification task in the setting of machine learning.

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems (Niu and Suen 2012). The database consists of a training set of 60,000 images and a test set of 10,000 images.

In this experimental study, the whole procedure involves feature extraction, feature selection and training and fusion of classifiers. During the CNN feature extraction, we input each digit image to the LeNet-5, and output its feature maps in the 3rd layer (16 × 10 × 10). Furthermore, those feature maps are changed into a single column. In terms of the setting of the CNN architecture, the activation function is set as sigmoid, and the loss function is set as a mean squared error, and the optimization function is set as l2 regularizer. In addition, the input Batch size is set as 1. There are 60,000 images in the MNIST and the Epoch is 1, therefore, there are 60,000 iterations in total.

In the feature selection stage, we apply the correlation-based feature subset selection method (Hall and Smith 1997) to obtain a reduced set of features. In this way, we use the reduced set of selected features alongside the original feature set extracted using CNN, such that diversity can be created through training classifiers on the two different feature sets.

In the classifiers training and fusion stage, we adopt KNN and RF for training base classifiers and primary ensembles, respectively, on the two feature sets. On each feature set, a secondary ensemble is obtained through combining the base classifier (trained using KNN) and the primary ensemble of decision trees (created using RF). The two primary ensembles created on the two feature sets are combined further to make up a larger ensemble for final fusion. The whole setting of the ensemble creation on each feature set is illustrated in Fig. 3.

Fig. 3
figure 3

Fusion of K nearest neighbor and random forest (Zhao and Liu 2018)

In terms of parameters setting, the K value for KNN is set to 3 and the trained random forest consists of 100 decision trees. The KNN and RF classifiers are fused through averaging their hidden outputs (probability for each class), i.e. the mean rule of algebraic fusion (Zhou 2012). All the experiments are conducted using 10-fold cross-validation.

The results on the MINST dataset is shown in Table 1 in terms of classification accuracy. The results indicate that the nature of the KNN method through instance-based learning leads to the accuracy of ≥ 95.8% on the two feature sets. Also, the RF method is generally very capable of training highly diverse decision tree classifiers on different training samples and feature subsets, which leads to the accuracy of ≥ 95.7% on the two feature sets.

Table 1 Classification accuracy

On the above basis, the further fusion of the base classifier (trained using KNN) and the decision tree (primary) ensemble (created using RF) leads to an improvement of the classification performance on each feature set, which indicates that the different learning strategies between the KNN and RF methods can really result in diversity between their trained classifiers. The final fusion of the above two secondary ensembles created on the two feature sets leads to a further improvement of the classification performance.

In addition, although feature selection may not necessarily lead to advances in the classification performance for each single classifier trained on the reduced feature set in comparison with using the full feature set, the fusion of classifiers trained on the two feature sets can lead to an improvement, which would indicate that the preparation of different feature sets through feature selection can effectively lead to the creation of diversity among classifiers trained on the different feature sets.

Overall, the experimental results suggest that multilevel fusion of classifiers through various ways of diversity creation is encouraged towards advances in the classification performance in a layer-by-layer manner.

5 Conclusions

In this paper, we have proposed a framework that involves CNN based feature extraction and multi-level fusion of diverse classifiers. In particular, we have designed to increase the diversity among classifiers through preparing different feature sets and using different learning algorithms for classifiers training. The experimental results show that our proposed ensemble approach can achieve the classification accuracy of ≥ 98% using the MNIST dataset and the results also indicate that the setting of ensemble learning which aims to train diverse classifiers is very useful to advance the overall performance of classification.

In future, we will investigate how to achieve optimal feature subsets selection to boost the performance further through using some optimization techniques (Chen and Chung 2006; Chen and Chien 2011; Chen and Kao 2013; Tsai et al. 2008, 2012). It is also worth to explore the effectiveness of the proposed framework in the setting of fuzzy ensemble learning (Nakai et al. 2003), where fuzzy set theory related techniques (Zadeh 1965; Wang and Chen 2008; Chen et al. 2009, 2012, 2013; Chen and Chen 2001, 2011; Chen and Tanuwijaya 2011; Chen and Chang 2011; Liu and Zhang 2018) are adopted to train base classifiers as the members of an ensemble (Liu and Chen 2018). Also, it is worth to investigate the effectiveness of adopting the proposed framework of ensemble learning in the context of multi-attribute decision-making (Xu and Wang 2016; Liu and You 2017; Chatterjee and Kar 2017; Lee and Chen 2008; Zulueta-Veliz and Garca-Cabrera 2018).