Abstract
Computer vision aims to build autonomous systems that can perform some of the human visual system’s tasks (and even surpass it in many cases)among the several applications of Computer Vision, extracting the information from the natural scene images is famous and influential. The information gained from an image can vary from identification, space measurements for navigation, or augmented reality applications. These scene images contain relevant text elements as well as many non-text elements. Prior to extracting meaningful information from the text, the foremost task is to classify the text & non-text elements correctly in the given images. The present paper aims to build machine learning models for accurately organizing the text and non-text elements in the benchmark dataset ICDAR 2013. The result is obtained in terms of the confusion matrix to determine the overall accuracy of the different machine learning models.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Computer Vision, often abbreviated as CV, can be formally defined as a field of study that seeks to develop techniques to help computers visualize and understand the content of digital images such as photographs and videos. It aims to develop some computational models for the human visual system concerning the biological view. Whereas, if the Engineering view is considered, it seeks to establish an autonomous system that will perform similarly to a human. Thus, Computer Vision (CV) has numerous applications in various domains of Engineering and medical sciences [1]. It finds application in the automotive, manufacturing, retail industry like Walmart and Amazon Go, financial services, health care, agriculture industry, surveillance, navigation by robots, automatic car driving sign translation, etc. Researchers are also developing an autonomous system to automatically extract the information from the old documents and help form digitized versions of such records. One of the most important uses of computer vision is to extract the text regions [2] from the natural scene images and born digital images, which will further assist in language and sing translation and tourist navigation. Thus, with such a vast domain of applications, CV plays an essential role in improving the quality of humanity.
1.1 Natural Scene Images
Natural Scene images [3] are images captured with the help of cameras or other handheld devices in pure natural conditions. These images may be incidental images or non-incidental images. These natural scene images contain images from Advertisement boards, billboards, notices, various boards from shops, hotels, and other public offices & buildings. Such type of images often contains non-text as well as text components within them. The text present in such images includes essential information about those images. Such data can be used for implementing different applications like tourist navigation, assistance in-car driving, etc. Figure 1 displays the samples from the many natural scene images datasets, such as ICDAR 2003 [4], ICADR 2011 [5], ICDAR 2013 [6], available for research works. The research in this domain is carried out with the help of these datasets only.
Examples of natural scene images [5]
The natural scene images contain various types of text, as shown in Fig. 1. The font of the text can be fancy or regular. It may prevent fonts of different orientations, colors, and different languages. In this paper, we are focusing on ICDAR datasets, which mainly contain the English language. The significant hurdles [7] in extracting the text regions apart from the variation in the font are the other non-text elements present in the images. The images contain various further details apart from the text regions. There may be natural scenery like trees, plants, and objects like chairs, tables, fencing, etc. These non-text elements must be removed from the images to get the proper text regions for extracting information from the text. This requires classifying the text and non-text features from the scene images, which is the paper’s main aim.
1.2 Classification in Machine Learning
Machine learning is a domain of Computer Vision (CV) and Artificial intelligence (AI) that uses data and algorithms to work similarly as humans learn, thus gradually improving its accuracy. Therefore, it can be stated that machine learning uses computer programs and data that can be used for its learning. The aim is to make the computer or given machine learn itself. The learning process requires observation or data that is available on various internet sources for the given problem.
The learning process requires the classification among the different types of sample spaces available for a given problem. Thus, category deals with providing labels to different objects or samples. The classification process requires training on the datasets, and those results are evaluated on the given testing sets. For this work, it is necessary to build different classification machine, learning models. The machine learning models are different supervised or unsupervised types of machine learning algorithms. These machine learning models are the pre-trained models that can be further used for testing purposes.
The present paper aims to build different machine learning models [8] to classify the text and non-text elements in the natural scene images. The machine learning models are evaluated based on the confusion matrix obtained and overall accuracy. The rest of the paper is organized as follows; Sect. 1 describes the basic introduction, Sect. 2 covers the literature review related to the problem, Sect. 3 demonstrates the proposed methodology with experiments, Sect. 4 discusses the results, Section 5 discusses the conclusion, and the future work.
2 Literature Review
The importance of the various applications like contents-based image retrieval, license plate recognition, language translation from the scene, word detection from document images encourages the researchers to work in text detection and recognition from the scene images. There are various categories [9] of the method available on which work has been carried out in the past, such as Region-based, Texture based, connected components based and Stroke based methods. Each method has one thing in common: text-specific features are required to classify the text and non-text elements present in the image. Thus, to identify the text and non-text elements correctly, one of the important tasks is the choice of the classifier, that will give maximum accuracy to the selected features.
The classification of the text & non-text elements is one of the crucial processes in text detection from scene images. Researchers have used different features and classifiers for classification purposes using machine learning algorithms. Iqbal et al. [10] propose using four classifiers, Adaboost M1, Regression, Bayesian Logistic, Naïve Bayes, & Bayes Net, to classify text & non-text components. The sample space taken consisted of only 25 images. Zhu et al. [11] use a two-stage classification process to separate the text & non- txt elements that increase time complexity. Lee et al. [12] and Chen and Yullie [13] discuss the utility of the AdaBoost classifiers, but the selection of the inappropriate features gives less efficient results. Pan et al. [14] propose implementing boosted classifier & polynomial classifier to separate the text & non-text components. MA et al. [15] insist on using a linear SVM and LBP & HOG & statistical features. Pan et al. [16] use a CRF using single perceptron & multi-layer perceptron classifier. Minori Maruyama et al. [17] propose implementing the classification work using SVM (RBF kernel) and stump classifier in the second stage. Fabrizio et al. [14] use K-NN in first stage & RBF kernel with SVM classifier in the second stage. Ansari et al. [18] insists a method for classifying components with the assistance of T-HOG & LBP (SVM) classifier. The drawback is the high computation cost.
There is no method mentioned for selecting the classifiers in the previous work done by the researchers. Most of the work is carried out using SVM classifiers and Adaboost Classifiers. There is no such method discussed in earlier work in this domain for selecting any classifier. They are chosen arbitrarily. Some of the methods have used two-stage classification that has increased the computation cost. The method in [19] uses SVM classifiers and thus takes a long time due to detailed segmentation. In some of the previous works [20], the inclusion of the deep learning architecture for classification purposes increases the computation time to a great extent.
Moreover, it requires a significant amount of time to train and give accurate results. The choice of the suitable classifier is one of the critical tasks in classification using machine learning algorithms. It will increase the accuracy of the results & reduce the time taken to give results. Therefore, choosing a classifier that will give high accuracy for classification of text & non-text elements in natural scene images is required.
3 Proposed Methodology
This section introduces the proposed methodology for building the machine learning models used in the paper to classify the text and non-text elements. The benchmark dataset ICDAR 2013 is used for the same. The images from the ICDAR dataset undergoes the modified WMF-MSER method to remove the connected characters and text present in the images. Further, then the classification is performed using the ground truth available for the images. The flowchart for the proposed method is shown in Fig. 2.
3.1 Introduction to MSER & WMF-MSER
The domain of Computer Vision involves one of the majorly used techniques for blob detection termed Maximal Stable Extremal Regions (MSERs). It was developed by Matas et al. [22], and therefore used extensively in the domain of the text region detection. The main principle of the method is to detect the similarity between the same images when viewed from two different angles. The MSERs remain stable throughout thresholds, which may be darker or brighter than their close areas. The pixels present in those extremal regions have either higher or lower intensity corresponding to those present on the boundary regions. Therefore, it helps identify the areas with a considerable variation of the intensity in the given images. The text present in the natural scene images has different intensity (higher or lower) compared to the background, and thus it helps in resembling the text with human eyes. Since the MSER works on the principle of the variation of the intensity, it motivates us to use the MSER method in our method for separating the interconnected text or characters.
From our previous work, stated in [21], we use the WMF-MSER algorithm for separating the interconnected characters. The results obtained by the WMF-MSER algorithm can be shown in Fig. 3. The resultant images in Fig. 3(c) have properly separated characters compared to the original images in Fig. 3(b). Thus, the main advantage of using WMF-MSER is that the features can be extracted accurately on these properly separated text elements. The features extracted will then be used for building the classification model using machine learning algorithms. In the next section, we will discuss the features used in the paper.
3.2 Extraction of the Features
The text elements present in the images have significant variations among themselves. The non-text elements are different from the text elements. The naked human eye can quickly identify this as we humans have complete information about the alphabets and text used in our native language. But certainly, machines cannot recognize such text or characters until they are trained for the same. The training process requires features to make a proper difference between two entities. In the same way, in this domain, it is inevitable to have appropriate mutually exclusive features for differentiating between the text and non-text elements.
Example of text elements [23]
Examples of non-text Elements [23]
Figures 3 and 4 display a few examples of the text and the non-text elements obtained after applying WMF-MSER. The researchers, over the years, extracted many features for the above-said work. In this paper, we prefer to choose three features: Maximum Stroke Width Ratio, Color Variation, and Solidity. The text elements present in the images have different sizes, colors, shapes as well orientations. So, we have considered three mutually exclusive features to differentiate between text & non-text elements properly. The definitions of the feature are as follows:
-
a)
Maximum Stroke Width Ratio (MSWR): The stroke width [24] of any text is one of its unique features. The stroke width of the text always remains uniform, and thus it is one of the prominent features to identify between the text and non-text elements. The non-text elements do not have uniform text width due to their irregular structure. So, the stroke width obtained for the non-text elements has many variations compared to the text elements. It is evident from the Figs. 4 and 5 that the text elements have uniform text width. On the other hand, non-text elements do not possess uniformity. So MSWR can be chosen as one of the features for separating the non-text & text elements.
-
b)
Color Variation: Color is one of the essential traits of any element that assists in differentiating objects. The text present in the images possesses different colors as compared to the non-text elements. The background present around the text also helps in identifying the text correctly from the images. Therefore, the variation in the color is taken as one of the features for classification purposes. The color variation is calculated by the Jenson-Shanon divergence (JSD) [25]. It calculates the difference between the color using the probability distribution of the text and its background.
-
c)
Solidity: The text elements in the images have a very uniform structure, and the non-text elements have a non-uniform stricture. Therefore, to differentiate the elements at the structural level, we choose solidity as the third feature in our work. It is the ratio of the area covered by total pixels in the region R to the area of the convex (smallest) hull surrounding that region.
Thus, we consider these three features mentioned above to build the classification models. These three features are mutually exclusive to each other. The mutually exclusive condition is essential as we must consider different aspects of the text for its discrimination with non-text elements. It will help us to identify the text more accurately as each feature is distinct. The MSWR is related to the uniformity present in the stroke width, and the color variation contributes to the different backgrounds of the elements (Text & non-text). In contrast, the solidity feature contributes to making a difference based on the uniformity of the area occupied by the elements. In the next section, the machine learning classifications models are built using the training dataset of ICDAR 2013.
3.3 Building Classification Models
Machine Learning includes classification, which predicts the class label for a given set of input data. The classification model provides a conclusion to assign a label to the object based on the input values given for the training and machine learning algorithm used. The classification problems are binary and multi-classification. The binary classification refers to labeling one out of two given classes, whereas it refers to one out of many classes in multi-classification. In this paper, we have a binary classification problem, in which the label is to give as text or non-text elements by the classification algorithm. The classification is performed based on the features extracted in the previous section. We have chosen four classifiers for the purpose, and experiments are performed using MATLAB [26] classifiers Learner Application. The dataset used for the training and building classification model is ICDAR 2013 dataset. It consists of 229 images from the natural scene images. These 229 images consist of 4786 text characters. We applied WMF-MSER algorithms and obtained 4549 non-text elements. After that, we calculated the three features on both texts and non-text elements, as mentioned in Sect. 3.2. The four classifiers chosen for the building classification model using the dataset and the three features are Bagged Trees [27], Fine Trees [28], K-Nearest Neighbor [29], Naïve Bayes [30]. There can be two possibilities for an element present in the images, text and non-text. The following parameters for classification are used in the paper:
-
a)
True Positives (TP): Text is discovered as text.
-
b)
True Negative (TN): Non-text is discovered as non-text.
-
c)
False Positive (FP): Non-text is discovered as text.
-
d)
False Negative (FN) Text is discovered as non-text.
Therefore, the overall accuracy (A) of the classifiers is interpreted as mentioned in the equation
The accuracy calculated in the equation is used as the final parameter for the overall accuracy of the classifiers.
4 Experiments and Results
The experimental setup and the results obtained are discussed in the given section. The three features are calculated on both txt (4786) and non-text (4549) elements are combined to make a feature vector (FV). There will be two classes, text (1) or non-text (0), so the class or response vector (R) consists of two values, 1 and 0. Thus the feature vector and class vector is shown as
For building the classification model, we prefer to use Matlab classification learner application for classification purposes. This application is a part of Matlab, which trains the model to classify the data. There are many classifiers based on the supervised machine learning algorithms available in this application. The data can be explored, trained, validated, and assessed using this application, which is very easy to use and gives accurate results. The detailed experimental set-up is displayed in Table 1.
The 10-fold cross-validation is used in the experiments to obtain good accuracy in this paper. The feature vector is passed as an input to the four classifiers mentioned in Sect. 3.3, and the accuracy for the different classifiers is obtained.
The results obtained are displayed in Table 1. It is evident from the Table that the highest accuracy is obtained for the Bagged Tree classifier. Bagging is an entirely data-specific algorithm. The bagging technique eliminates the possibility of over-fitting. It also performs well on high-dimensional data. Moreover, the missing values in the dataset do not affect the performance of the algorithm. The bagged tree combines the performance of the many weak learners to outperform the strong learner’s performance.
Therefore, the accuracy obtained from Bagged Tree is highest using the feature vector consists of three features due to the advantages mentioned above. The Confusion matrix, which consists of the TP, TN, FP, FN, is used to make the ROC for the classifiers and is shown in Figs. 6, 7, 8 and 9. The ROC curve is also an indicative measure of the best classifier based on the area occupied by the ROC curve (Table 2).
The area under the curve in the ROC curve is shown as best in the Bagged Trees cases, indicating that the bagged trees are the best classifiers among the rest three chosen classifiers.
The choice of the classifier is necessary for the classification of the text, and non-text elements are an essential step in the classification process. It is since many classifiers exists in the domain of machine learning algorithms. The researchers had made either arbitrary choice of the classifier or focused on the traditional approach to use SVM/ Adaboost classifiers. We contribute to achieving the task of selecting the classifier with the help of the Matlab Classifier Learner Application. This Matlab application is not very well explored in the classification for text & non-text elements.
In comparison with other states of the arts, Iqbal et al. [10] have considered 25 images of the ICDAR 2011 dataset for experiments, whereas we have chosen 229 images for choosing the classifier. The type of the images is very different and thus helps build a more accurate training model for handling different testing sets.
The method [31] applies CNN for classification and thus requires high computation time for evaluating the training model compared to proposed method using traditional classifiers. Mukhopadhyay et al. [32] used 100 images with one-class classifier & obtained 71% accuracy, whereas we acquired (83%) obtained in our work.
The methods using Deep learning have higher accuracy, but the issue lies in the computation cost, which is high in deep learning methods. An extensive training set [33] is required for the training process. These methods can detect the different text patterns [34, 35] in images, and the need for the GPU framework [36] increases the cost parameters. So, we choose to work on traditional machine learning classifiers and achieve results with small training sets.
5 Conclusion
The present paper demonstrates the work done to build a classifier model for the text and non-text classification present in the natural scene images. The classification of text and non-text elements is the preliminary step for detecting and extracting the text regions. The present paper explores the possibility of the existing machine learning algorithms to build the classification models. The reason behind this approach is to sue the simplicity of the model and perform experiments with less time and training data. The features used in the paper are mutually exclusive, so they will contribute to identifying the text and non-text correctly. ICDAR 2013 dataset is used in the paper as it provides proper ground truth available for the experimental purpose. The future work includes using the weka tool and other relevant edge smoothing filters as well as deep learning tool for classification purposes with new innovative text-specific features.
References
Distante, A., Distante, C.: Handbook of Image Processing and Computer Vision: Volume 2: From Image to Pattern (2020)
Rainarli, E.: A decade: review of scene text detection methods. Comput. Sci. Rev. 42, 100434 (2021)
Shivakumara, P., Alaei, Pal, U.: Mining text from natural scene and video images: a survey. Wiley Interdiscipl. Rev. Data Min. Knowl. Discov. e1428 (2021)
Lucas, S.M., et al.: ICDAR 2003 robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal. Recogn. (IJDAR) 7(2–3), 105–122 (2005)
Shahab, A., Shafait, F., Dengel, A.: ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: 2011 International Conference on Document Analysis and Recognition, pp. 1491–1496. IEEE (2011)
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. IEEE (2013)
Raisi, Z., Naiel, M.A., Fieguth, P., Wardell, S., Zelek, J.: Text detection and recognition in the wild: a review. arXiv preprint arXiv:2006.04305 (2020)
Sullivan, E.: Understanding from machine learning models. Br. J. Philos. Sci. (2020)
Shiravale, S.S., Sannakki, S.S., Rajpurohit, V.S.: Recent advancements in text detection methods from natural scene images. Int. J. Eng. Res. Technol. 13(6), 1344–1352 (2020)
Iqbal, K., Yin, X.-C., Yin, X., Ali, H., Hao, H.-W.: Classifier comparison for MSER-based text classification in scene images. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2013)
Zhu, A., Wang, G., Dong, Y.: Detecting natural scenes text via auto image partition, two-stage grouping, and two-layer classification. Pattern Recogn. Lett. 67, 153–162 (2015)
Lee, J.-J., Lee, P.-H., Lee, S.-W., Yuille, A., Koch, C.: AdaBoost for text detection in natural scene. In: 2011 International Conference on Document Analysis and Recognition, pp. 429–434. IEEE (2011)
Chen, X., Yuille, A.L.: Detecting and reading text in natural scenes. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, vol. 2, p. II. IEEE (2004)
Pan, Y.-F., Liu, C.-L., Hou, X.: Fast scene text localization by learning-based filtering and verification. In: 2010 IEEE International Conference on Image Processing, pp. 2269–2272. IEEE (2010)
Ma, L., Wang, C., Xiao, B.: Text detection in natural images based on multi-scale edge detection and classification. In: 2010 3rd International Congress on Image and Signal Processing, vol. 4, pp. 1961–1965. IEEE (2010)
Pan, Y.-F., Hou, X., Liu, C.-L.: A hybrid approach to detect and localize texts in natural scene images. IEEE Trans. Image Process. 20(3), 800–813 (2010)
Maruyama, M., Yamaguchi, T.: Extraction of characters on signboards in natural scene images by stump classifiers. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1365–1369. IEEE (2009)
Ansari, G.J., Shah, J.H., Yasmin, M., Sharif, M., Fernandes, S.L.: A novel machine learning approach for scene text extraction. Future Gener. Comput. Syst. 87, 328–340 (2018)
Wei, Y., Zhang, Z., Shen, W., Zeng, D., Fang, M., Zhou, S.: Text detection in scene images based on exhaustive segmentation. Sig. Process. Image Commun. 50, 1–8 (2017)
Long, S., He, X., Yao, C.: Scene text detection and recognition: the deep learning era. Int. J. Comput. Vis. 129(1), 161–184 (2021)
Soni, R., Kumar, B., Chand, S.: Extracting text regions from scene images using weighted median filter and MSER. In:2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), pp. 915–920. IEEE (2018)
Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004)
Soni, R., Kumar, B., Chand, S.: Optimal feature and classifier selection for text region classification in natural scene images using Weka tool. Multimedia Tools Appl. 78(22), 31757–31791 (2019). https://doi.org/10.1007/s11042-019-07998-z
Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2963–2970. IEEE (2010)
Majtey, A.P., Lamberti, P.W., Prato, D.P.: Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states. Phys. Rev. A 72(5), 052310 (2005)
The Math Works, Inc.: MATLAB, Version 2020a. Natick, MA: The Math Works, Inc. (2020). https://www.mathworks.com/. Accessed 28 May 2020
Mousavi, R., Eftekhari, M.: A new ensemble learning methodology based on hybridization of classifier ensemble selection approaches. Appl. Soft Comput. 37, 652–666 (2015)
Rokach, L., Maimon, O.Z.: Data Mining with Decision Trees: Theory and Applications, vol. 69. World Scientific (2007)
Fix, E., Hodges, J.L.: Discriminatory analysis. Nonparametric discrimination: consistency properties. Int. Stat. Rev./Revue Internationale de Statistique57(3), 238–247 (1989)
Zhang, Y., Jatowt, A.: Estimating a one-class naive Bayes text classifier. Intell. Data Anal. 24(3), 567–579 (2020)
Wu, H., Zou, B., Zhao, Y.-Q., Guo, J.: Scene text detection using adaptive color reduction, adjacent character model and hybrid verification strategy. Vis. Comput. 33(1), 113–126 (2015). https://doi.org/10.1007/s00371-015-1156-1
Mukhopadhyay, A., et al.: Multi-lingual scene text detection using one-class classifier. Int. J. Comput. Vis. Image Process. (IJCVIP) 9(2), 48–65 (2019)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition. arXiv preprint arXiv:1412.5903 (2014)
He, T., Huang, W., Qiao, Y., Yao, J.: Text-attentional convolutional neural network for scene text detection. IEEE Trans. Image Process. 25(6), 2529–2541 (2016)
Ou, W., Zhu, J., Liu, C.: Text location in natural scene. J. Chin. Inf. Process. 5(006) (2004)
Busta, M., Neumann, L., Matas, J.: Deep textspotter: an end-to-end trainable scene text localization and recognition framework. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2204–2212 (2)017
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this paper
Cite this paper
Soni, R., Sharma, D. (2022). Building Machine Learning Models for Classification of Text and Non-text Elements in Natural Scene Images. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_97
Download citation
DOI: https://doi.org/10.1007/978-981-19-2456-9_97
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)