Keywords

1 Introduction

Computer Vision, often abbreviated as CV, can be formally defined as a field of study that seeks to develop techniques to help computers visualize and understand the content of digital images such as photographs and videos. It aims to develop some computational models for the human visual system concerning the biological view. Whereas, if the Engineering view is considered, it seeks to establish an autonomous system that will perform similarly to a human. Thus, Computer Vision (CV) has numerous applications in various domains of Engineering and medical sciences [1]. It finds application in the automotive, manufacturing, retail industry like Walmart and Amazon Go, financial services, health care, agriculture industry, surveillance, navigation by robots, automatic car driving sign translation, etc. Researchers are also developing an autonomous system to automatically extract the information from the old documents and help form digitized versions of such records. One of the most important uses of computer vision is to extract the text regions [2] from the natural scene images and born digital images, which will further assist in language and sing translation and tourist navigation. Thus, with such a vast domain of applications, CV plays an essential role in improving the quality of humanity.

1.1 Natural Scene Images

Natural Scene images [3] are images captured with the help of cameras or other handheld devices in pure natural conditions. These images may be incidental images or non-incidental images. These natural scene images contain images from Advertisement boards, billboards, notices, various boards from shops, hotels, and other public offices & buildings. Such type of images often contains non-text as well as text components within them. The text present in such images includes essential information about those images. Such data can be used for implementing different applications like tourist navigation, assistance in-car driving, etc. Figure 1 displays the samples from the many natural scene images datasets, such as ICDAR 2003 [4], ICADR 2011 [5], ICDAR 2013 [6], available for research works. The research in this domain is carried out with the help of these datasets only.

Fig. 1.
figure 1

Examples of natural scene images [5]

The natural scene images contain various types of text, as shown in Fig. 1. The font of the text can be fancy or regular. It may prevent fonts of different orientations, colors, and different languages. In this paper, we are focusing on ICDAR datasets, which mainly contain the English language. The significant hurdles [7] in extracting the text regions apart from the variation in the font are the other non-text elements present in the images. The images contain various further details apart from the text regions. There may be natural scenery like trees, plants, and objects like chairs, tables, fencing, etc. These non-text elements must be removed from the images to get the proper text regions for extracting information from the text. This requires classifying the text and non-text features from the scene images, which is the paper’s main aim.

1.2 Classification in Machine Learning

Machine learning is a domain of Computer Vision (CV) and Artificial intelligence (AI) that uses data and algorithms to work similarly as humans learn, thus gradually improving its accuracy. Therefore, it can be stated that machine learning uses computer programs and data that can be used for its learning. The aim is to make the computer or given machine learn itself. The learning process requires observation or data that is available on various internet sources for the given problem.

The learning process requires the classification among the different types of sample spaces available for a given problem. Thus, category deals with providing labels to different objects or samples. The classification process requires training on the datasets, and those results are evaluated on the given testing sets. For this work, it is necessary to build different classification machine, learning models. The machine learning models are different supervised or unsupervised types of machine learning algorithms. These machine learning models are the pre-trained models that can be further used for testing purposes.

The present paper aims to build different machine learning models [8] to classify the text and non-text elements in the natural scene images. The machine learning models are evaluated based on the confusion matrix obtained and overall accuracy. The rest of the paper is organized as follows; Sect. 1 describes the basic introduction, Sect. 2 covers the literature review related to the problem, Sect. 3 demonstrates the proposed methodology with experiments, Sect. 4 discusses the results, Section 5 discusses the conclusion, and the future work.

2 Literature Review

The importance of the various applications like contents-based image retrieval, license plate recognition, language translation from the scene, word detection from document images encourages the researchers to work in text detection and recognition from the scene images. There are various categories [9] of the method available on which work has been carried out in the past, such as Region-based, Texture based, connected components based and Stroke based methods. Each method has one thing in common: text-specific features are required to classify the text and non-text elements present in the image. Thus, to identify the text and non-text elements correctly, one of the important tasks is the choice of the classifier, that will give maximum accuracy to the selected features.

The classification of the text & non-text elements is one of the crucial processes in text detection from scene images. Researchers have used different features and classifiers for classification purposes using machine learning algorithms. Iqbal et al. [10] propose using four classifiers, Adaboost M1, Regression, Bayesian Logistic, Naïve Bayes, & Bayes Net, to classify text & non-text components. The sample space taken consisted of only 25 images. Zhu et al. [11] use a two-stage classification process to separate the text & non- txt elements that increase time complexity. Lee et al. [12] and Chen and Yullie [13] discuss the utility of the AdaBoost classifiers, but the selection of the inappropriate features gives less efficient results. Pan et al. [14] propose implementing boosted classifier & polynomial classifier to separate the text & non-text components. MA et al. [15] insist on using a linear SVM and LBP & HOG & statistical features. Pan et al. [16] use a CRF using single perceptron & multi-layer perceptron classifier. Minori Maruyama et al. [17] propose implementing the classification work using SVM (RBF kernel) and stump classifier in the second stage. Fabrizio et al. [14] use K-NN in first stage & RBF kernel with SVM classifier in the second stage. Ansari et al. [18] insists a method for classifying components with the assistance of T-HOG & LBP (SVM) classifier. The drawback is the high computation cost.

There is no method mentioned for selecting the classifiers in the previous work done by the researchers. Most of the work is carried out using SVM classifiers and Adaboost Classifiers. There is no such method discussed in earlier work in this domain for selecting any classifier. They are chosen arbitrarily. Some of the methods have used two-stage classification that has increased the computation cost. The method in [19] uses SVM classifiers and thus takes a long time due to detailed segmentation. In some of the previous works [20], the inclusion of the deep learning architecture for classification purposes increases the computation time to a great extent.

Moreover, it requires a significant amount of time to train and give accurate results. The choice of the suitable classifier is one of the critical tasks in classification using machine learning algorithms. It will increase the accuracy of the results & reduce the time taken to give results. Therefore, choosing a classifier that will give high accuracy for classification of text & non-text elements in natural scene images is required.

3 Proposed Methodology

This section introduces the proposed methodology for building the machine learning models used in the paper to classify the text and non-text elements. The benchmark dataset ICDAR 2013 is used for the same. The images from the ICDAR dataset undergoes the modified WMF-MSER method to remove the connected characters and text present in the images. Further, then the classification is performed using the ground truth available for the images. The flowchart for the proposed method is shown in Fig. 2.

Fig. 2.
figure 2

Flowchart for the proposed methodology

3.1 Introduction to MSER & WMF-MSER

The domain of Computer Vision involves one of the majorly used techniques for blob detection termed Maximal Stable Extremal Regions (MSERs). It was developed by Matas et al. [22], and therefore used extensively in the domain of the text region detection. The main principle of the method is to detect the similarity between the same images when viewed from two different angles. The MSERs remain stable throughout thresholds, which may be darker or brighter than their close areas. The pixels present in those extremal regions have either higher or lower intensity corresponding to those present on the boundary regions. Therefore, it helps identify the areas with a considerable variation of the intensity in the given images. The text present in the natural scene images has different intensity (higher or lower) compared to the background, and thus it helps in resembling the text with human eyes. Since the MSER works on the principle of the variation of the intensity, it motivates us to use the MSER method in our method for separating the interconnected text or characters.

Fig. 3.
figure 3

WMF-MSER [21] a) original image b) original MSER [22] c) WMF-MSER

From our previous work, stated in [21], we use the WMF-MSER algorithm for separating the interconnected characters. The results obtained by the WMF-MSER algorithm can be shown in Fig. 3. The resultant images in Fig. 3(c) have properly separated characters compared to the original images in Fig. 3(b). Thus, the main advantage of using WMF-MSER is that the features can be extracted accurately on these properly separated text elements. The features extracted will then be used for building the classification model using machine learning algorithms. In the next section, we will discuss the features used in the paper.

3.2 Extraction of the Features

The text elements present in the images have significant variations among themselves. The non-text elements are different from the text elements. The naked human eye can quickly identify this as we humans have complete information about the alphabets and text used in our native language. But certainly, machines cannot recognize such text or characters until they are trained for the same. The training process requires features to make a proper difference between two entities. In the same way, in this domain, it is inevitable to have appropriate mutually exclusive features for differentiating between the text and non-text elements.

Fig. 4.
figure 4

Example of text elements [23]

Fig. 5.
figure 5

Examples of non-text Elements [23]

Figures 3 and 4 display a few examples of the text and the non-text elements obtained after applying WMF-MSER. The researchers, over the years, extracted many features for the above-said work. In this paper, we prefer to choose three features: Maximum Stroke Width Ratio, Color Variation, and Solidity. The text elements present in the images have different sizes, colors, shapes as well orientations. So, we have considered three mutually exclusive features to differentiate between text & non-text elements properly. The definitions of the feature are as follows:

  1. a)

    Maximum Stroke Width Ratio (MSWR): The stroke width [24] of any text is one of its unique features. The stroke width of the text always remains uniform, and thus it is one of the prominent features to identify between the text and non-text elements. The non-text elements do not have uniform text width due to their irregular structure. So, the stroke width obtained for the non-text elements has many variations compared to the text elements. It is evident from the Figs. 4 and 5 that the text elements have uniform text width. On the other hand, non-text elements do not possess uniformity. So MSWR can be chosen as one of the features for separating the non-text & text elements.

  2. b)

    Color Variation: Color is one of the essential traits of any element that assists in differentiating objects. The text present in the images possesses different colors as compared to the non-text elements. The background present around the text also helps in identifying the text correctly from the images. Therefore, the variation in the color is taken as one of the features for classification purposes. The color variation is calculated by the Jenson-Shanon divergence (JSD) [25]. It calculates the difference between the color using the probability distribution of the text and its background.

  3. c)

    Solidity: The text elements in the images have a very uniform structure, and the non-text elements have a non-uniform stricture. Therefore, to differentiate the elements at the structural level, we choose solidity as the third feature in our work. It is the ratio of the area covered by total pixels in the region R to the area of the convex (smallest) hull surrounding that region.

Thus, we consider these three features mentioned above to build the classification models. These three features are mutually exclusive to each other. The mutually exclusive condition is essential as we must consider different aspects of the text for its discrimination with non-text elements. It will help us to identify the text more accurately as each feature is distinct. The MSWR is related to the uniformity present in the stroke width, and the color variation contributes to the different backgrounds of the elements (Text & non-text). In contrast, the solidity feature contributes to making a difference based on the uniformity of the area occupied by the elements. In the next section, the machine learning classifications models are built using the training dataset of ICDAR 2013.

3.3 Building Classification Models

Machine Learning includes classification, which predicts the class label for a given set of input data. The classification model provides a conclusion to assign a label to the object based on the input values given for the training and machine learning algorithm used. The classification problems are binary and multi-classification. The binary classification refers to labeling one out of two given classes, whereas it refers to one out of many classes in multi-classification. In this paper, we have a binary classification problem, in which the label is to give as text or non-text elements by the classification algorithm. The classification is performed based on the features extracted in the previous section. We have chosen four classifiers for the purpose, and experiments are performed using MATLAB [26] classifiers Learner Application. The dataset used for the training and building classification model is ICDAR 2013 dataset. It consists of 229 images from the natural scene images. These 229 images consist of 4786 text characters. We applied WMF-MSER algorithms and obtained 4549 non-text elements. After that, we calculated the three features on both texts and non-text elements, as mentioned in Sect. 3.2. The four classifiers chosen for the building classification model using the dataset and the three features are Bagged Trees [27], Fine Trees [28], K-Nearest Neighbor [29], Naïve Bayes [30]. There can be two possibilities for an element present in the images, text and non-text. The following parameters for classification are used in the paper:

  1. a)

    True Positives (TP): Text is discovered as text.

  2. b)

    True Negative (TN): Non-text is discovered as non-text.

  3. c)

    False Positive (FP): Non-text is discovered as text.

  4. d)

    False Negative (FN) Text is discovered as non-text.

Therefore, the overall accuracy (A) of the classifiers is interpreted as mentioned in the equation

$$\mathrm{Accuracy }(\mathrm{A})= \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$

The accuracy calculated in the equation is used as the final parameter for the overall accuracy of the classifiers.

4 Experiments and Results

The experimental setup and the results obtained are discussed in the given section. The three features are calculated on both txt (4786) and non-text (4549) elements are combined to make a feature vector (FV). There will be two classes, text (1) or non-text (0), so the class or response vector (R) consists of two values, 1 and 0. Thus the feature vector and class vector is shown as

$$FV= \{SWV, CV, S\}$$
$$R=\{\mathrm{0,1}\}$$

For building the classification model, we prefer to use Matlab classification learner application for classification purposes. This application is a part of Matlab, which trains the model to classify the data. There are many classifiers based on the supervised machine learning algorithms available in this application. The data can be explored, trained, validated, and assessed using this application, which is very easy to use and gives accurate results. The detailed experimental set-up is displayed in Table 1.

Table 1. Experimental details for building classification models

The 10-fold cross-validation is used in the experiments to obtain good accuracy in this paper. The feature vector is passed as an input to the four classifiers mentioned in Sect. 3.3, and the accuracy for the different classifiers is obtained.

The results obtained are displayed in Table 1. It is evident from the Table that the highest accuracy is obtained for the Bagged Tree classifier. Bagging is an entirely data-specific algorithm. The bagging technique eliminates the possibility of over-fitting. It also performs well on high-dimensional data. Moreover, the missing values in the dataset do not affect the performance of the algorithm. The bagged tree combines the performance of the many weak learners to outperform the strong learner’s performance.

Therefore, the accuracy obtained from Bagged Tree is highest using the feature vector consists of three features due to the advantages mentioned above. The Confusion matrix, which consists of the TP, TN, FP, FN, is used to make the ROC for the classifiers and is shown in Figs. 6, 7, 8 and 9. The ROC curve is also an indicative measure of the best classifier based on the area occupied by the ROC curve (Table 2).

Table 2. Classification accuracy obtained for four classifiers
Fig. 6.
figure 6

ROC curve for Bagged Trees

Fig. 7.
figure 7

ROC curve for Fine Tree

Fig. 8.
figure 8

ROC curve for KNN Tree

Fig. 9.
figure 9

ROC curve for Naïve Bayes

The area under the curve in the ROC curve is shown as best in the Bagged Trees cases, indicating that the bagged trees are the best classifiers among the rest three chosen classifiers.

The choice of the classifier is necessary for the classification of the text, and non-text elements are an essential step in the classification process. It is since many classifiers exists in the domain of machine learning algorithms. The researchers had made either arbitrary choice of the classifier or focused on the traditional approach to use SVM/ Adaboost classifiers. We contribute to achieving the task of selecting the classifier with the help of the Matlab Classifier Learner Application. This Matlab application is not very well explored in the classification for text & non-text elements.

In comparison with other states of the arts, Iqbal et al. [10] have considered 25 images of the ICDAR 2011 dataset for experiments, whereas we have chosen 229 images for choosing the classifier. The type of the images is very different and thus helps build a more accurate training model for handling different testing sets.

The method [31] applies CNN for classification and thus requires high computation time for evaluating the training model compared to proposed method using traditional classifiers. Mukhopadhyay et al. [32] used 100 images with one-class classifier & obtained 71% accuracy, whereas we acquired (83%) obtained in our work.

The methods using Deep learning have higher accuracy, but the issue lies in the computation cost, which is high in deep learning methods. An extensive training set [33] is required for the training process. These methods can detect the different text patterns [34, 35] in images, and the need for the GPU framework [36] increases the cost parameters. So, we choose to work on traditional machine learning classifiers and achieve results with small training sets.

5 Conclusion

The present paper demonstrates the work done to build a classifier model for the text and non-text classification present in the natural scene images. The classification of text and non-text elements is the preliminary step for detecting and extracting the text regions. The present paper explores the possibility of the existing machine learning algorithms to build the classification models. The reason behind this approach is to sue the simplicity of the model and perform experiments with less time and training data. The features used in the paper are mutually exclusive, so they will contribute to identifying the text and non-text correctly. ICDAR 2013 dataset is used in the paper as it provides proper ground truth available for the experimental purpose. The future work includes using the weka tool and other relevant edge smoothing filters as well as deep learning tool for classification purposes with new innovative text-specific features.