Introduction

Machine-based recognition of handwritten alphabets is one of the requirements of language-based automation. Intrinsic, unconditional diversity in writing styles, shapes, scales, skews, orientations, and deformations of handwritten alphabets are the main associated challenges. As a result of their massive populations not having embraced English as their first language, nations like India, China, Egypt, Saudi Arabia, and the United Arab Emirates are building automation systems in their own national tongues to benefit most of their populations. Many advancements have been reported for language-based automation systems related to English script due to its worldwide acceptance. Systems based on globally emerging languages like Hindi (Devnagari), Mandarin, Arabic, Javanese, Urdu, and Persian require extra care. Efforts have been made in the present work for the Devanagari script. A set of Hindi numerals is shown in Fig. 1.

Fig. 1
figure 1

Set of handwritten Hindi numeral

Several methods have been implemented so far for solving the proposed problem. Some benchmarking models are described as follows: Das et al. [1] received quad-tree longest-run and modular principal component analysis (PCA)-based features from numeral images and concatenated them. The classification was done with a one-versus-all support vector machine (SVM) classifier. Iamsa et al. [2] crafted a histogram of gradient (HOG) features from handwritten Hindi digits. The feedforward backpropagation neural network (FBNN) and extreme learning machine (ELM) were implemented as classification algorithms; the former was the top performer.

Khanduja et al. [3] created a hybrid of structural and statistical features that included intersection points, end points, loops, and pixel distributions. The feedforward neural network was employed for the recognition of numerals. Singh et al. [4] examined the performance of five distinct classifiers: multilayer layer perceptron (MLP), Naïve Bayes (NB), logistic classifiers, random forest (RF), and SVM over local weighted run-length features received from numeral images. Acharya et al.’s [5] introduction of the deep convolutional neural networks (DCNN) model with a dropout function marked a turning point for Devanagari alphabet recognition algorithms. With the noble purpose of advancing relevant research, the authors have generated a benchmarking dataset of isolated handwritten Devanagari characters and made it freely accessible to the public. The effect of adding more layers to convolutional neural networks (CNN) on the recognition of Devanagari alphabets was studied by Chakraborty et al. [6]. A hybrid CNN and bidirectional long-short-term memory (BLSTM) model was also tried; however, it fell short of the performance of the standard CNN model. AlexNet, a pre-trained DCNN model, was used by Sonawane et al. [7] to present the transfer-learning method for identifying Devanagari characters. Aneja et al. [8] provided a thorough comparison analysis based on pre-trained DCNN models, including AlexNet, DenseNet-121, DenseNet-201, VGG-11, VGG-16, VGG-19, and Inception-V3, for the identification of Devanagari alphabets. Trivedi et al. [9] implemented a genetic algorithm and the L-BFGS optimization method to train CNN for addressing the concerns of getting stuck in local optima and the large number of iterations. Their evolutionary technique achieved a higher recognition rate for handwritten Devanagari numerals. Kumar et al. [10] introduced a convolution autoencoder based on unsupervised learning to extract reduced-sized features from the augmented numeral images of Devanagari, English, and Bangla scripts. A deep convolutional network was employed for the final classification using these features. Chaurasia et al. [11] employed CNN as a feature extractor to receive salient features from handwritten numeral images of various Indian scripts. The authors employed an SVM classifier to avail the benefit of structural risk minimization. Sarkhel et al. [12] developed a state-of-the-art multicolumn, multi-scale CNN architecture for capturing important features from the images of handwritten characters related to several Indian scripts. A SVM classifier was employed for the classification task.

Some recent studies presented benchmark approaches to solving similar problems. Rakshit et al. [13] produced a comparative study of 11 different CNN models, namely, DenseNet-201, MobileNetV2, VGG-19, EfficientNetB0, NASNetMobile, Xception, Inception ResnetV2, ResNet50, EkushNet, InceptionV3, and ResNet152V2, in recognition of handwritten Bangla characters. ResNet152V2 was the top performer. Garg et al. [14] examined k-NN and SVM classifiers with linear, polynomial, and radial basis function (RBF) kernels in machine-based recognition of Gurumukhi characters. Peak extent and modified division point-based features were crafted for the purpose. In their later study [15], the authors presented a multifeature, multi-classifier approach for solving the problem of recognizing Gurumukhi script from degraded images. The authors employed zoning, diagonal, shadow, and peak extent-based features on k-NN, decision tree, and RF classifiers. Kathigi et al. [16] developed a skewed line segmentation technique to separate the individual Kannad characters. Steerable pyramid and discrete wavelet transforms were implemented to extract salient features. The classification was performed with LSTM using combined features. Narang et al. [17] employed CNN for feature extraction as well as for classification in the recognition of ancient characters in Devanagari script. Authors experimented with CNN architecture by varying counts of layers and filters, the size of stride and kernel, and activation functions in search of the best combination. To avoid manual feature engineering in the recognition of handwritten Urdu characters. Mushtaq et al. [18] developed a CNN model that outperformed the model based on handcrafted features. Robert Raj et al. [19] developed a recognition model for handling the problems of discontinuity, overlooping, and unnecessary portions presented in the structure of Tamil characters. The authors introduced a junction point elimination algorithm that outperformed conventional feature selection and pre-extraction algorithms. Deore et al. [20] finely tuned the popular deep convolutional neural network model VGG16 with advanced adaptive gradients to recognize handwritten Devanagari characters. Moudgil et al. [21] developed a convolution-based capsule network that captures spatial relationships among local features and reduces the vector length for effective classification of Devanagari characters. Guo et al. [22] proposed a solution for the recognition of similar-shaped Tai Le characters. The authors estimated the second- and third-level wavelet transforms for given character images and converted them into wavelet deep convolution features. Linear discriminant and principal component analysis were applied to limit the feature dimensionality. The classification model included six deep, variationally sparse Gaussian processes for efficient recognition. It has been observed that deep learning techniques are replacing conventional feature extraction and classification techniques in this field in order to attain improved recognition accuracy [23].

It could be observed that the deep learning-based models achieved a significant recognition rate without the need for handcrafted features. The only concern is their large feature vectors, which may leverage the classification cost. Optimizing the size of the feature vector can lead to a low-classification-cost solution [24] in the following terms.

  • Training time: A smaller feature vector typically implies fewer features that need to be processed and used to train a classifier. The computational complexity of training algorithms may scale with the number of features, leading to shorter training times for reduced feature sizes.

  • Memory usage: A smaller feature vector requires less memory to store the feature values during training and classification processes. This can lead to reduced memory usage, which can provide cost-effectiveness if there are limitations on the available memory resources.

  • Computational complexity: The computational complexity of the classification algorithms (SVM in the present case) can be influenced by the size of the feature vector. The computational complexity of SVM training and classification depends on the number of support vectors, which are the data points nearest to the decision boundary. The dimensionality of the problem decreases by reducing the number of features, and it becomes computationally less expensive to find the support vectors. Also, the number of kernel evaluations required during training and classification decreases, leading to faster execution.

Motivation

The state-of-the-art models could be categorized into two classes: (1) the models adopted a machine-learning approach, and (2) the models employed deep convolutional neural networks.

Machine learning typically involves the use of statistical models that are trained on labeled data to make predictions or decisions. Machine learning models are often simpler and more interpretable. These models can often be trained on smaller datasets with fewer parameters. The model’s success significantly depends on the handcrafted features that are extracted from the data. Important concerns about handcrafted features are time consumption [25], the requirement of domain expertise and careful feature engineering [26], bias due to the designer’s prior assumptions that may not capture all relevant information in the data, and limited scalability, generalization, and reproducibility due to problem-specific design. This can limit the effectiveness of the model and lead to suboptimal performance.

The deep convolutional neural networks address these concerns through their potential to auto-generate features from raw images. These networks are well known for producing human-like performance in the field of pattern recognition. The main issues related to the implementation of these networks are the requirements of large datasets, millions of trainable parameters, and high computational complexity, which can restrict their deployment on low-end hardware platforms such as embedded systems, Raspberry Pi, field programmable gate arrays (FPGA), and cell phones.

The pros and cons of the abovementioned approaches induced the motivation for developing a recognition model to bridge the gap between them and receive optimum advantages.

Contribution

In the proposed work, the network architecture of benchmark DCNN models VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3 was modified as a feature extractor to exploit their auto-generative feature capabilities. The classical PCA method was adopted for optimizing the size of feature vectors received from individual models. The optimized feature vectors were fused together in a strategic manner to obtain the maximum recognition rate from the benchmark SVM classifier. The suggested model provided a low-classification-cost solution to the proposed problem in terms of feature vector size.

Preliminary

An overview of the techniques used in the presented work is given in the following subsections.

VGG-16Net

This is a convolutional neural network architecture developed by the Visual Geometry Group (VGG) at Oxford University. It is named after the fact that it consists of 16 layers, which include convolutional layers, pooling layers, and fully connected layers [27]. The VGG-16Net architecture was designed for image recognition and classification tasks and achieved state-of-the-art performance on the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2014. The network consists of a series of 3 × 3 convolutional layers, each followed by a rectified linear unit (RELU) activation function and a 2 × 2 max pooling layer. The final layers of the network consist of fully connected layers that perform the classification task. VGG-16Net is a deep neural network that has 138 million parameters and requires significant computational resources to train. However, pre-trained versions of the network are available and can be used for transfer learning, which allows for faster training on new image recognition tasks.

VGG-19Net

VGG-19Net is a convolutional neural network architecture with 19 layers [27]. It was developed by the Visual Geometry Group (VGG) at the University of Oxford and achieved state-of-the-art performance in ILSVRC-2014. The architecture of VGG-19Net is like that of VGG-16Net but with the addition of three extra 3 × 3 convolutional layers. VGG-19Net has 143 million parameters and requires significant computational resources to train. However, pre-trained versions of the network are available and can be used for transfer learning, which allows for faster training on new image recognition tasks.

ResNet-50

ResNet-50 is a deep convolutional neural network architecture that was introduced by Microsoft Research in 2015. The name “ResNet” comes from “residual network,” which refers to the use of residual connections, or skip connections, which allow information to bypass certain layers in the network. This helps mitigate the vanishing gradient problem, which can occur when training very deep neural networks [28]. ResNet-50 consists of 50 layers and is used primarily for image recognition tasks such as object detection and classification. The architecture of ResNet-50 is based on the building blocks known as residual blocks, which consist of two convolutional layers and a skip connection. The skip connection allows the input to be added directly to the output of the residual block, which helps preserve information and gradients through the network.

Inception-v3

Inception-v3 is a convolutional neural network architecture that was introduced by researchers at Google in 2015 [29]. The architecture of Inception-v3 is based on the use of “inception modules,” which consist of several parallel convolutional layers with different filter sizes. This allows the network to capture features at multiple scales and helps reduce the computational cost of the network. Inception-v3 also uses a technique called “factorization,” which decomposes large convolutions into smaller convolutions. This helps reduce the number of parameters in the network and improve its computational efficiency. Inception-v3 also includes other features such as batch normalization and dropout regularization, which enhance the generalization performance of the network.

Principal component analysis

It is a statistical method that can be used to reduce the dimensionality of a dataset by projecting the original data onto a lower-dimensional subspace defined by the principal components. This projection preserves as much of the original variability as possible while reducing the number of dimensions needed to represent the data [30]. PCA has several applications, including data compression, feature extraction, and the visualization of high-dimensional data. It is also commonly used as a preprocessing step for other machine learning algorithms to reduce the number of features and improve the accuracy of the model. The steps involved in the estimation of the principal components are described as follows:

Let X be a data matrix of dimension N × F, where N is the number of samples and F is the number of features.

  1. 1.

    Standardization of X:

    $$Z= \left(X-\mu \right)/\sigma$$
    (1)

where Z is the standardized data matrix, μ is the mean vector of X, and σ is the standard deviation vector of X. This transforms each feature of X to have zero mean and unit variance, which ensures that all features are on the same scale and have equal importance in the analysis.

  1. 2.

    Calculation of the covariance matrix related to standardized data:

    $$S= \left(1/N\right) \times {Z}^{T}\times Z$$
    (2)

    where, S is covariance matrix and \({\mathrm{Z}}^{\mathrm{T}}\) is the transpose of Z.

  1. 3.

    Determining the eigenvectors and eigenvalues of the covariance matrix by the following:where V and λ represent eigenvectors and eigenvalues respectively and can be denoted as follows:

    $$S \times V=\uplambda \times V$$
    (3)
$$V=\left[{\mathrm{V}}_{1}{,\mathrm{ V}}_{2 ,}{\mathrm{V}}_{3,}\dots {\mathrm{V}}_{\mathrm{F}}\right]$$
$$\lambda =\left[{\uplambda }_{1}{,\uplambda }_{2 ,}{\uplambda }_{3,}\dots {\uplambda }_{F}\right]$$

The eigenvectors represent the principal components, and the eigenvalues represent the variance explained by each principal component.

  1. 4.

    Calculation of principal components:

    $$PC=Z \times V$$
    (4)

Where PC represents principal components.

Support vector machine

It is a popular and powerful machine learning algorithm used for classification and regression analysis. The basic idea behind an SVM is to find the hyperplane that best separates the data points of different classes. The hyperplane is chosen so that it maximizes the margin, which is the distance between the hyperplane and the closest data points in each class. The data points closest to the hyperplane are called support vectors. SVMs can handle both linearly separable and nonlinearly separable data by using different types of kernels. A kernel function transforms the original data into a higher-dimensional feature space, where it may become linearly separable. Some commonly used kernel functions include the linear, the polynomial, and the RBF kernels. In addition to binary classification, SVMs can be extended to handle multiclass classification problems by using techniques such as one-vs-all and one-vs-one [31]. SVMs have several advantages over other classification algorithms, including their ability to handle high-dimensional data, their robustness to overfitting, and their effectiveness even with small datasets. In the proposed work, an SVM classifier was employed with the one-versus-all technique and an RBF kernel. The classification cost of a one-versus-all SVM classifier can be calculated as follows:

Let “m” be the number of classes and “n” be the number of training samples. Let “d” be the dimensionality of the feature vector. During training, the one-versus-all SVM classifier trains m separate binary SVM classifiers, one for each class. Each binary SVM classifier is trained on a subset of the training data that consists of the samples from one class and the samples from all other classes. Let C be the regularization parameter of the SVM, and let “k” be the kernel function used by the SVM. The training complexity of the one-versus-all SVM classifier can be expressed as follows:

$$O\,\left(m\,\times\,{n}^{2}\,\times\,d\right)\,\times\,[\mathrm{complexity\,of\,the\,kernel\,function\,k}]$$
(5)

During classification, the one-versus-all SVM classifier applies each of the m binary SVM classifiers to the test sample and selects the class with the highest score. Let “t” be the number of test samples. The classification complexity of the one-versus-all SVM classifier can be expressed as follows:

$$O\,\left(m\,\times\,t\,\times\,d\right)\,\times\,\left[\mathrm{complexity\,of\,the\,kernel\,function\,k}\right]$$
(6)

The complexity of the RBF kernel (k) used in an SVM classifier depends on the number of training samples and the dimensionality of the feature vector. The RBF kernel function is defined as follows:

$$k \left(x, {x}^{\mathrm{^{\prime}}}\right) =\mathrm{exp} \left(-\Upsilon \times ||x- {x}^{\mathrm{^{\prime}}}{||}^{2}\right)$$
(7)

where x and x′ are two feature vectors, ||.|| is the Euclidean distance between them, and ϒ is a parameter that determines the width of the kernel. The complexity of the RBF kernel function can be calculated as follows:

For a single evaluation of the kernel function, the time complexity is O(d), since we need to compute the Euclidean distance between the two feature vectors. To evaluate the kernel function for all pairs of training samples, the complexity is as follows:

$$O \left({n}^{2} \times d\right)$$
(8)

Since there are \({\mathrm{n}}^{2}\) pairs of training samples and we need to compute the kernel function for each pair. From Eqs. (5), (6), and (8), it is obvious that the various complexities of the SVM classifier directly depend on the dimensionality (d) of the feature vector. This suggested that optimizing feature vectors in terms of dimensionality (size) would improve the classification cost. The same is true for other classifiers.

Methods

The complete overview of the proposed scheme is depicted in Fig. 2.

Fig. 2
figure 2

Design of proposed model

Input dataset

The input dataset is compiled from a public repository [5]. The dataset has accurate labelling for each handwritten numeral in Hindi script. The dataset exhibits a wide range of variations in writing styles, size, slant, stroke thickness, etc. that are commonly encountered in real-world scenarios. It has a balanced distribution of numerals across different classes, which can ensure bias-free training. The dataset has satisfactory sample counts of 20,000. All these reasons make it a suitable choice for the proposed work.

Dataset preprocessing

The pretrained DCNN models have specific input size requirements. In the presented work, images were resized in the input dataset to match the input size expected by the individual models. Details about the required input image size for proposed DCNN models are provided in Table 1.

Table 1 Input size requirements of DCNN models

The resized images for VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3 were represented by S1, S2, S3, and S4, respectively, in Fig. 2.

Feature extraction

The architecture of individual models was modified for the purpose of feature extraction. The classification block of the individual DCNN models typically consists of fully connected layers with a large number of parameters (of the order of millions). These layers are responsible for mapping the extracted features to 1000 class labels, as the classification blocks of individual models were originally designed to solve the classification problem of the ImageNet dataset with 1000 object classes. Since the objective of the proposed strategy is to exploit the auto-generative feature capability of pretrained DCNN models, their classification blocks are of no use. In the modified architecture, the classification blocks were removed completely to eliminate the computational burden and memory requirements associated with the fully connected layers. The remaining convolutional layers in the modified architecture were locked out of further training in order to take advantage of transfer learning. Arrangements have been made to collect the features after the final convolutional layer of each model. The individual models were set as feature extractors. The simplified architectures of modified networks are shown in Fig. 3. The resized images (S1, S2, S3, S4) were applied to the respective modified DCNN architectures, VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3. The sizes of corresponding feature vectors derived for a given digit image were 4096, 4096, 2048, and 2048, respectively, and were represented by F1, F2, F3, and F4 in the design (refer to Fig. 2). The process of feature extraction is depicted in Algorithm 1.

Fig. 3
figure 3

Modified architecture of DCNN models as feature extractor. a VGG-16Net. b VGG-19Net. c ResNet-50. d Inception-v3

figure c

Algorithm 1. Algorithm for feature extraction

Feature optimization

This stage included the feature reduction and feature fusion steps of the proposed methodology. The background details of the numeral images were almost identical and did not carry any pattern-related information (refer to Fig. 1). This has suggested the possibility of having redundant information in the individual feature types (F1 to F4). The principal component analysis (refer to the “Inception-v3” section) has been applied to individual feature types F1 to F4 to eliminate feature collinearity. The trial-and-error technique was used to identify the optimum number of principal components. First, 10 PCA components were estimated using separate feature vectors (i.e., F1, F2, F3, and F4). These elements were combined to form one vector. A sample dataset of 500 such fused feature vectors (50 samples from each numeral class) was made for the specified purpose. The sample dataset was used to train and test the proposed classifier. For the sample dataset, the procedure was repeated while stepping up the principal component counts from 10 to 40 in increments of 2. The recognition accuracy was seen to greatly increase between components 10 and 20, but no further significant increases were seen. This suggests that 20 component counts are the optimal number. The various feature vectors (F1, F2, F3, and F4) were reduced in dimension by the suggested approach to 20, and the resulting reduced feature vectors were shown as R1, R2, R3, and R4 accordingly (refer to Fig. 2). The reduced features R1 to R4 were concatenated into a single feature vector Z. The frame format of the fused feature vector Z is shown in Fig. 4. The size of the proposed optimized features becomes 80. The vector Z was estimated for all the numeral images in the input dataset. The reduced feature vectors R1 to R4 and the fused feature vector Z were used to create five new datasets. The process of feature optimization is depicted in Algorithm 2.

Fig. 4
figure 4

Frame format of fused feature vector Z

figure d

Algorithm 2. Algorithm for feature optimization

Numeral recognition

Besides the input dataset, five new datasets have been created up to this stage. The details are given in Table 2. Datasets D1 to D4 were created from the features received from VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3, respectively, after feature optimization. Dataset D5 was created by concatenating the features related to datasets D1 to D4. The individual datasets were split into train and test sets in a ratio of 75:25. The SVM classifier was trained and tested with individual datasets. Details of the hyperparameters used during the classifier learning are given in Table 3. The results were recorded in terms of precision, recall, F1 score, and recognition accuracy. The formulations used for the calculation of the metrics are given in Table 4. Here, TP, TN, FP, and FN represent true-positive, true-negative, false-positive, and false-negative events during the testing phase of the proposed classifier. A comprehensive result analysis is provided in the next section.

Table 2 Summary of newly created datasets
Table 3 Major parameters of SVM classifier
Table 4 Details of performance metrics used in the proposed study

Results

Results obtained from various datasets are compiled in Tables 5, 6, 7, 8, 9 and 10. Arrangements have been made to estimate the confusion matrix using classification reports given in Tables 5, 6, 7, 8, 9 and 10. This would generate more readability about the model’s performance. The consolidated results are compiled in Table 11. The model achieved highest recognition accuracy of 99.72% with the proposed fusion-based feature scheme (Dataset D5).

Table 5 Result obtained from input dataset (images)
Table 6 Result obtained from dataset D1 (VGG-16Net)
Table 7 Result obtained from dataset D2 (VGG-19Net)
Table 8 Result obtained from dataset D3 (ResNet-50)
Table 9 Result obtained from dataset D4 (Inception-v3)
Table 10 Result obtained from dataset D5 (proposed fusion-based features)
Table 11 Consolidated results obtained from various datasets used in the proposed study

Feature separation

Arrangements were made to visualize the separation between the features related to various numeral classes in the input dataset and the proposed feature scheme (dataset D5) by using t-SNE (t-distributed stochastic neighbor embedding) algorithm. It is a popular dimensionality reduction algorithm used for visualizing high-dimensional data in a low-dimensional space while preserving the structure of the original data as much as possible. With the help of Gaussian kernel, t-SNE computes a similarity score for each data point with every other data point based on Euclidean distance. The similarity scores were used to compute probability distributions for both the high-dimensional and low-dimensional spaces. The goal of t-SNE is to minimize the divergence between the probability distributions in the high-dimensional and low-dimensional spaces. The algorithm does this by adjusting the positions of the data points in the low-dimensional space so that the probability distributions match as closely as possible.

A significant separation between the features related to different numeral classes could be observed in Fig. 6b, which derived from the proposed feature scheme, in comparison to Fig. 6a, which derived from the raw images of the input dataset. The more separation between the features, the easer their classification.

The results of various benchmark models, along with a proposed one, are compiled in Table 12. It should be noted that there is no standard dataset of handwritten Hindi numerals in the public domain, and the results of benchmark models as mentioned in Table 12 were based on different datasets.

Table 12 Results of benchmark models along with proposed one

The proposed model produced a comparable recognition rate to the benchmark models, that too with a smaller feature vector and a higher number of test samples. Small is the size of the feature vector, and low will be the training and classification complexities (refer to “Principal component analysis” section).

Discussion

Figure 5 demonstrates the efficiency of the proposed scheme. The confusion matrix in Fig. 5a was derived when the classifier was tested with the input dataset (i.e., numeral images directly). A higher degree of confusion could be observed between numeral classes 2–3, 4–5, and 6–7; also, a significant count of false-negative (FN) predictions was recorded for numeral classes 5 and 7. All these regions of the confusion matrix were encircled in red. The confusion matrix in Fig. 5b to e was derived when the classifier was tested with datasets D1 to D4. Clear improvements could be observed in the encircled regions of the respective matrices with respect to Fig. 5a. This was also reflected in the recognition accuracy achieved with these datasets (refer to Table 11). The confusion matrix in Fig. 5f shows tremendous improvements over Fig. 5a and the rest. This matrix was derived by testing the classifier with the proposed fusion-based feature scheme (dataset D5). The matrix has minimal confusion. This suggested the potential of the proposed scheme in the selection of prominent features related to different numeral classes that could be helpful in their precise recognition by the given machine learning algorithm.

Fig. 5
figure 5

Confusion matrix related to a input dataset, b dataset D1, c dataset D2, d dataset D3, e dataset D4, and f dataset D5 (proposed dataset)

Figure 6 demonstrates the effectiveness of the proposed scheme in selecting distinct features related to various numeral classes. Proposed feature optimization resulted in a good separation between the features related to various numeral classes in the feature space, which contributed to achieve the comparable recognition rate to the benchmark models.

Fig. 6
figure 6

Separation between the features related to various numeral classes in a input dataset and b proposed feature scheme (dataset D5)

Referring to Table 12, the proposed model achieved comparable recognition accuracy to benchmark models by considering fewer numbers of features, which suggest its potential of solving the given problem with low-classification cost.

Conclusions

Most of the benchmark models relied on either a machine learning or deep learning approach. The former is simpler and more interpretable; it can be trained with small datasets and fewer parameters, but the need for manual feature engineering limits its performance. On the other hand, deep learning methods can autogenerate the salient features. These methods need large datasets and millions of trainable parameters to produce excellent results. The proposed study presented an effective ensemble of these state-of-the-art approaches. The benchmark DCNN models VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3 were employed as feature extractors that produced large feature vectors. The size of feature vectors was optimized by careful implementation of the classical PCA method, which led to a low-classification-cost solution to the proposed problem. The optimized features were fused together in a systematic manner and used to train the benchmark SVM classifier. The proposed model successfully achieved comparable results to the benchmark models with a smaller feature vector. Small is the size of the feature vector, and low will be the training and classification complexities. Although medical imaging and related pattern recognition problems are not within the scope of the current study, we are hopeful that the proposed fusion-based feature scheme would also be helpful in solving these kinds of problems effectively.