1 Introduction

Understanding action in images and videos is of growing interest for the researchers in the field of artificial intelligence and image understanding related fields. It plays a significant role in real world applications such as activity monitoring, interactions between computers and humans, and video indexing [1,2,3] etc. The problem remains an ongoing challenge due to factors like uneven illumination effects, partial occlusion, and complex background [4]. However, since temporal information is involved in the process of action detection, the methods consume a greater number of computations [5]. This has motivated many researchers to develop methods which can cope with the challenges of still images for action detection and recognition [6, 7]. Although these methods are more efficient, they are not as accurate as video-based methods. Hence, there is a wide gap between still-image and video-based methods in understanding the content of such images. This issue motivated us for considering text detection and recognition to enhance the performance of action image detection and recognition. The advantage of text detection and recognition is that the process does not require temporal information unlike existing methods that explore temporal frames for action detection [8].

Several methods for detecting and recognizing text in natural scene images can be found in the literature [9,10,11,12]. These address many challenges such as arbitrarily oriented text, multi-script text, arbitrarily shaped text etc. However, since most of these methods were developed for natural scene images without much action in them, they may not perform well for our applications of interest because the presence of actions degrades the text information. As a result, text in action images loses quality, contrast, and sharpness and hence hurts the performance of existing methods. As seen from the sample images shown in Fig. 1, which illustrate five classes of interest (Concert, Cooking, Craft, Teleshopping and Yoga), text can be adversely impacted by action in an image, as well as by other confounding factors including poor quality, low contrast, and uneven illumination. This provides motivation for classifying video images based on different actions, allowing the choice of an appropriate method tuned to the complexity of the specific problem to improve text detection and recognition performance. For example, for the images in Fig. 1 for cooking and craft, the background is an indoor scene, while for concert, teleshopping and yoga, the background is an outdoor scene. Likewise, for cooking, craft and yoga actions classes, caption text (which is edited text) is a prominent feature, while for concert and teleshopping, scene text (which is part of the image) is a prominent feature. Based on these observations, we propose a new method for classification of action images drawn from five classes in this work.

Fig. 1
figure 1

Example of five action image classes (Original Source: [42])

Inspired by the tremendous discriminative ability of deep CNNs [13,14,15,16,17] in complex tasks, we explore their use for the classification of action images. The proposed method exploits the learning ability of deep neural networks at the pixel and component levels to develop a deep hybrid classification framework. The following are the contributionsof this work:

  1. (i)

    We demonstrate new classification strategy that adapts existing methods for a novel set of sub-tasks and combines them into an ensemble for solving this complex problem.

  2. (ii)

    Use of features extracted at the stable-component pixel level and from facial regions for defining unique relationship between foreground and background information in the input image is new.

  3. (iii)

    The manner in which we combine the individual sub-task specific features obtained from deep neural networks is new for the task of action image classification.

  4. (iv)

    Lastly, our exploration towards the concept of multi-modality by combining text, face, and general pixel distribution-level information through a set of deep learning models is also new.

The rest of the paper is structured as follows. The critical analysis of existing methods for text detection, recognition and scene image classification is presented in Sect. 2. The proposed hybrid deep learning models for classification of action images are presented in Sect. 3. Section 4 provides variety of experiments to validate the proposed method which includes ablation study, experiments on classification of action images and experiments on text detection and recognition. Section 5 highlights the summary of the proposed work and describes our future work.

2 Related work

Since the objective of our work is to classify action images for enhancing performance of text detection and recognition, here we provide a critical analysis of existing methods on detection, recognition, and scene image classification.

In [9], the authors used a deep neural network for detecting text in natural scene images. It addresses the challenges of arbitrarily oriented texts and complex background images. However, the method is limited to natural scene images but not for video which contain different actions. In [10], a method was presented for detecting scene text using deep reinforcement learning wherein an agent, given a state, learns to estimate future returns. Further, the method makes sequential decisions to find scene texts. The method in [12] used a technique for reading scene texts in the wild based on scene text proposal. The method explores a score function that uses histogram of oriented gradients, and based on their probability of being texts, the method ranks the proposals accordingly. The method in [11] leveraged color prior guided MSER for natural scene text detection. The method extracts stroke features using strokes width distance, which is based on segmented edges. From the above discussions, it is noted that the scope of the above methods is confined to text detection in natural scene images. Therefore, these methods may not be effective for video images because the latter usually contains multi-type texts, namely, graphics (which is superimposed text) and scene text (which is a natural text as in a scene image). To overcome this problem, the method [18] proposed Fourier-Laplacian filtering and Hidden Markov Model for text and non-text classification. It uses hand-crafted features for eliminating false positives to improve performance. In summary, the above text detection methods focus on the challenges such as low resolution, multi-direction, multi-script, and complex background of texts in natural scene images or video frames, but not action images where one can expect unpredictable background complexity (the nature of texts are shown in Fig. 1). For detecting text in natural scene images, the authors in [19] propose a two-staged approach using a quadrilateral scene text detector. However, the structural assumption behind the use of such a structure may not hold for text that is irregular or arbitrarily oriented. The method presented in [20] employs an adaptive Béziercurve-based network for spotting text in scene images. The technique attempts to improve text detection performance by fixing an accurate bounding box for text lines that are arbitrarily oriented. In [21], the method uses similarity estimation between the text components of multiple views of natural scene images for obtaining enhanced performance. However, the requirement of multiple views as input is a constraint. Zhu and Du [22] proposed a scene text detection method based on segmentation approach. The method introduces Text-Mountain architecture, which finds the relationship between center and border of a text to overcome the challenges of text detection.

Similarly, we can find several text recognition methods for natural scene and video images. The method discussed in [23] used an edge descriptor by exploring local binary patterns. However, it is noted that the method is not robust to complex background images. The method [24] explores CNN for multi-lingual text recognition in natural scene images. The scope of the method is limited to natural scene images. As a result, the above methods may not perform well for texts in video images due to the presence of multi-type texts, namely, caption and scene texts, in video, unlike natural scene images which contain only scene texts. To expand the ability of text recognition methods, the method [25] proposes fractals, wavelet transforms and optical flow for tackling the challenge of video and natural scene images. Although, the above methods address the issues of natural scene and video images, it is not sure how the methods behave on action images. The method [26] extracts the dependencies between word tokens in a sentence. This helps to extract 2D spatial dependencies between two characters in a scene text image. The method [27] uses character anchor pooling for scene text recognition. With this step, the method gathers more vital information for recognizing text in the images. The method [28] introduces character awareness network for scene text recognition. The model involves 2D character attention model, which enhances foreground text instances based on character awareness. Lin et al. [29] proposed a sequential transformation attention-based network for text recognition in scene images. The network rectifies irregular text by dividing the task into a series of patch-wise basic transformations. Further, the model uses neighbor information to preserve the shape of characters to achieve recognition results. However, the performance of the method degrades for curved text and arbitrarily shaped text, which are common in the case of action images.

There are methods that use text recognition for different applications. For example, [30] proposed a method for defect extraction in sewers of CCTV inspection videos utilizing text recognition. The method detects and recognizes texts in video images captured by CCTV camera for the purpose of location identification and severity of cracks. The paper [31] proposed a flash flood categorization system using scene text recognition. The approach detects and recognizes texts in bridge images, which works well for different situations such as fog, sunny afternoon and during dusk time. The method [32] uses text detection and recognition for person re-identification. The approach uses deep learning using CNN and LSTM. The discussions on the above-mentioned methods show that for tackling different situations, text detection and recognition has been explored, which includes natural scene, video images, etc. However, one can note that the purposes of the methods are specific and work only with specific situations. None of the methods focuses on the situation like action images.

If we consider the posed problem as general scene categorization, we can find several methods like the ones above for classification of different scenes and images. Bosch [33] used a different color space and leverages SIFT features and analysis of probabilistic latent semantics, which results in a hybrid generative/discriminative approach. Dunlop [34] proposed a method based on the semantic segmentation of images and videos of indoor and outdoor scenes for the purpose of classification. A public API made available by Google, known as the Google Vision API [35], can be used for annotating scene images. The underlying system uses deep-learning-based features for labeling query images. The method [36] uses combination of CNN and LSTM to classify scene images of different categories based on multiple views and levels. The method considers each image with multiple views to deal with the variations of the images. The above methods work well for general scene images, where we can see multiple objects with clear shapes in images. But this restriction may not be true in case of action images considered in this work. In addition, the objective of the above method is to classify scene images but not enhancing performance of text detection and recognition methods. The method [37] explores color spaces, gradient distribution and Gabor wavelet binary pattern for classifying water images of different types. Xie et al. [38] proposed a knowledge distillation strategy based on EfficientNet for ImageNet classification. The model achieves high accuracies by introducing noise to the training procedure and making the student network equal to or even more powerful (relative to its size) than the teacher. Dosovitskiy et al. [39] proposed a model that overcomes some of the drawbacks of conventional convolutional neural networks by entirely replacing convolutions with attention blocks, i.e., by employing a pure transformer directly to the sequence of patches of an input image for classification. The methods are applicable to general images but ignore text information.

Several methods proposed for improving text detection and recognition performance in literature. For example, the method [40] uses a combination of local and global features for video image categorization. However, the success of methods depends on the success of text detection. In the same way, the method [41] employs the combination of face, torso detection and text detection methods for enhancing the performance of bib number detection and recognition. The method is developed specifically for Marathon images, which depicts a very specific scene setting. The methods [8] explores the combination of rough set and fuzzy for classifying scene images based on text and background information. The method extracts feature for each classified edge component on the classification of images. However, the methods are not tested on action images without text information. Recently, the method [42] proposes the combination of Discrete Cosine Transform and Fast Fourier Transform for classifying caption and scene texts in action images to improve text recognition results. The method generates a fused image for the input and then the average of sparsity and non-sparsity counts in terms pixel values of zero or non-zeros is computed for classification. However, the method is limited to text line image classification and not for action images.

It is noted from the above review on text detection, recognition, scene image classification and the classification methods for enhancing text detection and recognition performance, none of the methods aims at classifying action images for amplifying text detection and recognition performance. In addition, there are methods that focus on image types or situations for classification but without targeting text detection and recognition performance. As a result, one can conclude that there is a need for developing a method that can work for action images. Thus, we propose a new hybrid deep learning model for classifying action-oriented images to improve performance of text detection and recognition.

3 The proposed methodology

Since the classification of action images is a complex problem, our idea is to extract information at different levels, like pixels, components, and facial information if available, to ease the problem. It is observed that the relationship between foreground and background information can be represented in a unique way. For example, in the case of Yoga, Concert, Teleshopping images, it can be expected outdoor scenes represent background while actions represent foreground. Similarly, for Cooking and Craft class images, indoor scene represent background and actions represents foreground. To extract such observations, the proposed work detects region of interests, which includes text information as well as dominant region of background and facial information because these features represent foreground while the pixel distribution and values represent both foreground and background of the images. This is the main institution to combine region of interest (text components) given by Maximally Stable Extremal Regions (MSER) [43], Face information given by Multitask Cascaded Convolutional Neural Network (MTCCN) [44] and pixel information by VGG16 model. Overall, the way the proposed method combines features and deep learning models is able to tackle the challenges of action images classification.

The use of the Residual Network as a general classifier was motivated by the method [17], where it is mentioned that the deep residual learning technique converges much faster than standard solvers that are unaware of the residual nature of the solutions. In addition, ResNet50, being a neural network of a very deep architecture, has a high entropic capacity compared to other relatively shallower networks and, hence, can better cope with large intra-class variations and random noisy pixels in the images. These factors motivated us to explore ResNet50 for learning parameters at the pixel level.

This step outputs a vector of five elements indicating the probabilities of the membership in the input image of respective classes, which is denoted as P1. According to our experiments, the P1 architecture achieves more than 99% classification rate for Yoga action class. As a result, the proposed method considers other four classes, namely, concert, cooking, craft, teleshopping for training at components given by MSER. The distribution of stable extremal regions tends to be relatively similar in the images of a particular class but differ largely between images of different classes. Due to its computational simplicity and robustness, the proposed work considers the detection of Maximally Stable Extremal Regions (MSERs). The outputs of the MSER detection are fed to an MSER-based classifier for training the VGG16 net [16]. It is noted that MSER usually extracts components that have the uniform color [43]. In general, character and object in images are formed by the uniform color. Besides, those two play a prominent role in representing the content of images. To extract such observations, we train VGG16 at component level. This step outputs a vector of 4 elements, indicating the probabilities of the membership of the input image of respective classes, which is denoted as P2. In the same way, to exploit face information in action images, we propose to use Multitask Cascaded Convolutional Network [44], which employs deep learning for face detection. The output of face detection is fed to VGG16 [13] to train at face component level. Since the Craft and Cooking classes do not provide faces and for Yoga class, P1 gives more than 99% classification rate, we train VGG16 with face components of other two classes, namely, Concert and Teleshopping, which gives a vector of 2 elements denoted as P3.

The three obtained vectors, P1, P2 and P3, are supplied to a Fully Connected Neural Network (FCNN) that aims at learning an effective combination of the probability vectors for classifying the original images. The FCNN has one hidden layer of 16 units with ReLU activations, and one output layer of 5 units with Sigmoid activations that denote the final class confidences. The number of output units in the final layer of the fully connected network can be varied based on the number of classes in dataset. Furthermore, the proposed method finds correct architecture which fits for the proposed classification according to pre-defined experimental results. Increasing the number of hidden layers or the number of units in the hidden layer would always lead to overfitting, and even regularization or dropout could not be compensated for. The optimal performance achieved from such a simple architecture is because the probabilities given by deep neural networks are mostly in agreement with each other regarding the class of input image and hence, it is relatively easy for the final network to learn a combination of those probabilities. The major role of the model combination step is to determine the weights that are to be assigned to the outputs of the classifiers for the input probability distributions given by the three classifiers with the actual ground truth labels of the image. The framework of our approach is shown in Fig. 2.

Fig. 2
figure 2

The framework of our model. (P1 indicates features vectors extracted from the pixel in formation, P2 indicates features extracted from MSER regions and P3 indicates, the features extracted from only Face region

The proposed method explores ResNet50 and VGG16 models for learning parameters both at pixels and component levels to extract distinct features for action images of different classes, which results in feature vectors containing probabilities. The proposed method consists of four sub-sections. First, Subsect. 3.1 presents ResNet50 for studying the overall pixel distribution to predict classes. VGG16 is then explored for learning at component level given by MSER for getting probability score for the classes in Subsect. 3.2. Since many classes considered in this work contain face information, VGG16 is also explored for learning facial information at face level given by MTCNN and it is discussed in Subsect. 3.3. Finally, the unification of the three deep neural networks is presented in Subsect. 3.4 for action image classification.

3.1 ResNet50 for overall pixel-distribution based class prediction

We take the following approaches for training deep net classifiers at different levels. For every class, 90% of samples used for training and 10% of samples are used for testing. Inspired by the method [45] where transfer learning has been introduced for deep neural networks, we explore the same for the purpose of classifying action images. In this work, we use neural networks that are pre-trained on ImageNet [46] and fine-tuned [47] with the experimental datasets corresponding to this work. The process involves the removal of the top fully connected layers and the introduction of a new randomly initialized fully connected layer of size 1024 units with ReLU activations. The weights of all the previous residual layers are fixed. The output layer is a new fully connected layer with 5 outputs having sigmoid activations. Having trained these two newly added fully connected layers, the weights of the top-most (last) residual block were made trainable and subsequently fine-tuned (trained with a very low learning rate) along with the two final fully-connected layers. The reason behind this approach is that the convolutional blocks towards the beginning learn to detect very general structures like edges, curves, shapes, etc., while the layers towards the end detect more class-specific features. So, keeping the earlier layers fixed, the layers towards the end were trained in order to make the network learn the parameters corresponding to those features which are relevant to the classes in our classification problem. The training was done in two phases in order to prevent large and noisy gradient updates from severely disturbing the previously learned weights of the model. More details about the architecture of ResNet50 can be found in [17].

The loss function used in all the classifiers presented in this work is categorical cross-entropy as defined in Eqs. (1) and (2).

The categorical cross-entropy loss for a single training example is given by:

$$L_{i} = - \mathop \sum \limits_{c = 1}^{M} y_{c} log\left( {p_{ c} } \right)$$
(1)

and for the overall training set is given by:

$$L = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{c = 1}^{M} y_{c}^{\left( i \right)} log\left( {p_{ c}^{\left( i \right)} } \right)$$
(2)

where N is the number of samples in the training set, M is the number of classes, yc(i) is the binary indicator (0 or 1) if class label c is the correct classification for observation i, pc(i) is the predicted probability observation i of class c.

3.2 MSER-based classifier for learning dominant stable-region based features

It is noted that Maximally Stable Extremal Region (MSER) works well for grouping the pixels that share similar values or properties in the images, which results in connected components [43]. In other words, MSER detection aims at extracting regions that stay nearly the same through a wide range of thresholds. The sample results of MSER for input images are shown in Fig. 3, where one can see, the components are formed based on similarity of intensity values. The training set for the MSER based classifier is prepared by running the MSER detection algorithm on all the images of four class datasets. After MSERs are detected, all the pixels that do not belong to the MSER are masked out and thus, the training set consists of images with only their MSERs. A transfer-learning based procedure, similar to the one that was adopted for training the ResNet50, is used for fine-tuning and training VGG16 (pre-trained on ImageNet) at component level by feeding the output of the MSER extraction step to VGG16 and optimizing the categorical cross-entropy loss. This results in a vector containing five probabilities representing the confidence of the MSER-based classifier, regarding the membership of the image to each of the four classes.

Fig. 3
figure 3

The result of MSER extraction

3.3 Face-based classifier for learning facial image features

It has been observed that facial features (orientation, expression, degree of occlusion, etc.) of action images have much less variations within a class compared to the variations between images from different classes. Face detection is carried out over the images of Concert and Teleshopping classes. This is because in general, Cooking and Craft may not provide face information and for Yoga class, P1 achieves the best results. Therefore, the following steps are used for detecting faces in images of the above-mentioned two classes. Face detection uses Multitask Cascaded Convolutional Network (MTCNN) as in [44]. The candidate of facial windows and their bounding box regression vectors using fully convolutional Proposal Network (P-Net) are obtained. The candidates are passed to next level of CNN called Refine Network (R-Net), which conducts bounding box regression and Non-Maximum Suppression (NMS) to filter out false candidates. The sample face detection results for the images of Concert and Teleshopping are shown in Fig. 4. The reason to choose the above MTCNN for face detection is that the method is robust for the proposed work according to our preliminary experiments. The faces extracted from images of respective classes are used as a training set for the face classifier. For this, we explore VGG16 as discussed above for an MSER based classifier. This results in a vector containing three probabilities representing the confidence of the face-based classifier, regarding the membership of the image to the two respective classes.

Fig.4
figure 4

The results of face detection method for Concert and Teleshopping images shown in Fig. 1

In this work, both RestNet50 and the face-based classifier were trained and fine-tuned for 50 epochs with learning rates of 0.001 during training and 0.0001 during fine-tuning. The Adam optimizer was used to update the weights during training and the Stochastic Gradient Descent optimizer with a momentum of 0.9 was used for fine-tuning the networks.

3.4 Unification of the 3 deep convolutional neural networks

The proposed method explores ResNet50 for training at overall pixel-distribution level and VGG16 at component levels, which results in three deep neural networks. The outputs obtained from the three networks are concatenated to obtain a 11-dimensional vector of probabilities (5 from pixels, 4 from MSER and 2 from face components). Next, the vector with 11-dimensions is supplied to a FCNN that learns a mapping between the output of the three classifiers and the actual class of the image. The final fully-connected neural network comprises 1 hidden layer consisting of 16 hidden units with Leaky ReLU activation, and an output layer of 5 units with Sigmoid activation, indicating the probabilities of the membership of the input image to the 5 respective classes. The proposed model considers the categorical cross-entropy loss as the objective function for training the network. The final architecture of the proposed work is illustrated in Fig. 5.

Fig. 5
figure 5

The final combination phase of the proposed Hybrid Deep Net for action image classification. (P1, P2 and P3 are discussed in Fig. 2)

4 Experiments

Our experimental analysis is presented in four sub-sections. Description of the datasets, different performance measures and different existing methods used for comparison purpose are given in Sect. 4.1. To show the effectiveness of the key steps we are using, Sect. 4.2 presents an ablation study. Section 4.3 provides experimental results for the proposed method and existing techniques on classification. To validate the proposed classification is useful, Sect. 4.4 provides experimental analysis of the text detection and recognition.

4.1 Dataset, performance measure and work for comparison

Dataset To evaluate the proposed method for classifying action images, we create our own dataset which includes five classes, namely, Concert (Ct), Cooking (Ck), Craft (Cr), Teleshopping (Te) and Yoga (Yg). We use different internet sources, such as YouTube and other social media for dataset collection. Our dataset includes 5078 images of five classes. For training and testing of each class, the proposed method considers 90% and 10%, respectively. The same criterion is followed for all the classification experiments in this work. To test the scalability and effectiveness of the proposed method on classification, we consider the dataset used for video image type categorization in [8], where the method considers 10 classes, namely, Defense (D), Economics (Ec), Sports (S), Medical (M), Weather (W), Animation (A), e-learning (e-L), Technology (T), Outlet (O) and Animal Planet (AP). Since each class comprises 3,000 frames, for the 10 classes, it provides 30,000 frames for validating the proposed method. Note that this dataset includes natural scene images having multi type texts of video, while our dataset includes scene images with different actions and multi-type texts. In addition, both the datasets pose challenges like poor resolution, contrast, font and font size, background variations, arbitrarily oriented texts, etc., due to different nature and characteristics. In summary, our and the standard datasets provide, in total, 35,078 frames for conducting experiments.

To test the scalability and generality, we also consider the benchmark dataset called Stanford40 Actions [47], which contains 40 classes of general actions, and each class contains 1,80,300 images. Therefore, it gives total 9532 images for all the 40 classes. The focus of this dataset is to capture general actions of persons. Another standard dataset, called Scene Text Dataset (STD), which provides 10 classes of scene images containing text information [8]. However, our dataset is to capture person actions having multiple type text information. In addition, our dataset may not include images of person faces compared to Stanford40 Actions dataset. The experiment involves total 44,610 images from three datasets. We believe that the dataset collection with different focuses and nature ensure the fair evaluations of the proposed approach on classifying action images.

Performance Measure To evaluate the performance of the proposed approach, we consider Average Classification Rate (ACR) of confusion matrix as measures for evaluating the methods. We estimate standard measures, namely, Recall, Precision and F-Score for text detection experiment, and Recognition Rate (RR) for text recognition to validate the usefulness of the proposed method. The recognition rate is estimated using the edit distances based on insertion, substitution, deletion operations for all the experiments. In case of text detection and recognition experiments, we estimate the measures for before and after classification to show the advantage of the proposed classification. It is expected that the text detection and recognition methods should report better results after classification compared to a prior classification. It is not necessarily true for all the classes due to the limitations of the text detection and recognition methods. Note that for text detection and recognition experiments, we consider only our dataset because Stanford40 Actions dataset does not provide text information.

Method for Comparison The following state-of-the-art methods are used for comparative study in this work. Roy et al. [8], proposes a method for classification of scene images containing text information to improve text detection and recognition performance. Qin et al.’s method [40] which uses statistical, structural, spatial features and color spaces with SVM for classification of video text frames. The method [38] uses self-supervised approach for scene image classification (Noisy Student) and transformer-based (Vision Transformer) method [39] for scene image classification. Both the methods exploit the deep-learning literature for achieving the results for classification of scene images.

Apart from the above, we also evaluate the performance of the Google Vision API on our task, which is a standard deep learning based public API for performing various tasks on scene images [35]. Given an input image, the API returns confidence scores for a set of pre-defined classes. As an example, if an input image from the ‘Animal Planet’ class is given to the API, it can return high confidence scores for classes like trees, nature, animals, birds, agriculture, etc. For an input image belonging to the class ‘Animation’, labels like cartoon, amusement park, amusement ride, etc., could be returned. The returned labels are native to the Google Vision API system. Using the above confidence scores, a feature vector for each class is created for the training samples. A cut-off threshold for confidence score is set at 85%. This value is determined empirically based on the results of the training set. As observed from our experiments, increasing this threshold incorporates irrelevant labels, and decreasing it discards useful labels, which leads to an eventual loss of information in the representation.The above-mentioned methods represent state-of-the-art for comparison purpose.

In addition, to demonstrate the impact of our approach on downstream processing stages, we present experimental results for detecting and recognizing text in an input image before and after our classification step. For that purpose, we implement the well-known method called EAST, which is developed for accurate scene text detection [9]. The method uses deep convolutional neural networks for tackling challenges of arbitrarily oriented text detection in natural scene images. Zhu et al. [22] describes a method which employs a TextMountain network for scene text detection. For recognition, we implement the E2E-MLT [24], which proposes a model for multi-language scene text recognition. The approach uses deep learning models for recognizing texts in scene images irrespective of scripts. A method by Lin et al. [29] uses an attention-based network for recognizing text in natural scene images. The motivation to choose the above four methods for text detection and recognition experiment is that the approaches employ deep learning, which are capable of handling complex situations of text detection and recognition.

4.2 Ablation study

There are three steps of the proposed method, namely, training the deep learning model, ResNet50 at pixels level for classifying action mages, and VGG16 model for MSER and Face components. We carried out experiments for calculating the classification rate (mean of the confusion matrix diagonal) separately for each of the components of our framework, in order to understand their individual contributions. The Average Classification Rates (ACR) for ResNet50, VGG16 of MSER and VGG16 for face components are 87.6%, 87.7% and 76.5%, respectively. It is noted that ResNet50 and VGG16 of MSER components score almost the same results, while VGG16 of face component repots a bit low score compared to ResNet50 and VGG16 of MSER. This is due to the absence of faces in Craft and Cooking classes. For these two classes, the face detection method outputs either nothing or false results. When it gives nothing, the proposed method adds zeros to feature vectors for the purpose of implementation. On the other hand, ResNet50 and VGG16 for MSER give good results compared to VGG16 of face components because those models get sufficient information from each image class. However, when we combine all the three deep nets, the proposed method shows 90.0% ACR. Therefore, we conclude that all the three mentioned deep nets are effective to achieve better results.

As discussed in the above sections, the proposed method combines three classifiers to achieve better results for action image classification. In order to assess the effectiveness of each classifier, we conduct experiments on each classifier and the combined classifier for our dataset and the results are recorded in Table 1. It is noted from Table 1 that each classifier is effective and contributes equally for classifying action images of respective classes. Hence, the combined components provide a higher classification rate than any of the individual components. In Table 1, ‘-’ indicates that the classifier does not consider the class for training and testing.

Table 1 Confusion matrices of the proposed individual models and combined model on our dataset

4.3 Evaluating the proposed classification approach

Qualitative results of the proposed method on our dataset are shown in Fig. 6, where we can see the images with clutter background and multi–type texts are classified successfully. Quantitative results for the proposed and existing approaches on our dataset are compared in Table 2 by the means of confusion matrix and classifications rate. It can be clearly seen from Table 2 that the proposed technique outperforms all of the other methods we studied in terms of classification rate. The main reason for the poor results of the existing methods, namely GOOGLE API [35], Qin et al. [40], Xie et al. [38] and Dosovitskiy et al. [39] is that they only work well when multiple objects in images preserve their shapes. However, Roy et al.’s method [8] defines limited shapes for edge components in images for classification. Limited shapes are the main cause for poor results in case of the action dataset because of unpredictable shapes. It is noted from Table 2 that the method [39] gives the second highest accuracy compared to the other existing approaches. This is because the model replaces the inductive bias introduced by the convolution operation with the general idea of attention, which can be used to directly model all possible relationships among the set of image patches. Since the model considers high-level semantic information for classification, it has better generalization ability compared to the other methods. However, it still falls short of the technique we have presented here. The proposed method combines the strength of pixels, dominant components given by MSER, and face components, to achieve superior classification performance.

Fig. 6
figure 6

Sample images of successful classification of the proposed modelon our dataset. Original Source: [42]

Table 2 Confusion matrices of the proposed and existing approaches on the proposed dataset

The proposed and existing approaches are tested on Roy et al.’s dataset (STD) [8] which provides 10 classes of different video types, and the Standford40 dataset [47] which provides 40 general action image classes through confusion matrix and classification rate. These two datasets are considered as the standard ones. Qualitative results of the proposed technique on STD are shown in Fig. 7, where all the images are classified successfully. It is noted from Fig. 7 that sample images of all the classes contain texts of different types. Quantitative results of the proposed and different existing approaches on Roy et al.’s dataset are reported in Tables 3, 4, 5, 6, 7, 8. It is observed from Tables 3, 4, 5, 6, 7, 8 that the proposed method is the best at classification rate compared to the existing methods. When we compare the results of this dataset with five classes of the action dataset, the proposed method scores low. This makes sense because the proposed method is developed for action images of five classes but not scene images of 10 classes. Interestingly, the results of our dataset and the STD dataset show that one can see the high variations in the classification rate for our dataset, while for STD dataset, the classification rates of the other methods are more similar. It can be inferred that variations in action images are less predictable compared to scene images. However, in terms of overall classification rate, the proposed method outperforms all of the other methods.

Fig. 7
figure 7

Examples of successful classification of the proposed approach on STD dataset. Original Source: [8]

Table 3 Confusion matrix of the proposed approach on STD dataset (Average Classification Rate: 77.89%)
Table 4 Confusion matrix of the Roy et al. method [8] on STD dataset (ACR: 76.0%)
Table 5 Confusion matrix of the GOOGLE API method [35] on STD dataset (ACR: 71.7%)
Table 6 Confusion matrix of Noisy Student [38] on STD dataset (ACR: 72.6)
Table 7 Confusion matrix of Vision Transformer (ViT) [39] on STD dataset (ACR: 73.4)
Table 8 Confusion matrix of the Qin et al. technique [40] on STD dataset (ACR: 70.2)

To test the proposed approach does not depend on text in the images for classification of actions images, we conduct experiments on the benchmark dataset called Stanford40 Action dataset [47], which does not contain text information. Qualitative results of the proposed approach for Stanford 40 Actions dataset are shown in Fig. 8, where one can see that the proposed method classifies different action images successfully. Quantitative results of the proposed and existing methods are reported in Table 9, where we list classification rates for each class. Even though the number of classes increases to 40 from 5 and images of different classes do not contain text, the proposed method achieves the best classification rate, that is, AoA (Average of Average), compared to the existing methods.

Fig. 8
figure 8

Examples of successful classification of the proposed approach on Stanford 40 Actions dataset. Original Source: [47]

Table 9 Average Classification Rate of the Proposed and Existing Approach on Stanford 40 Actions Dataset

This is due to the way the proposed method trains the three different deep nets at different levels. Therefore, we can infer that the proposed approach is generic and capable of classifying different classes irrespective of the number of classes and content. However, the existing methods perform poorly because of the inherent limitation of the methods. It is noted from the experiments on the 5-class, 10-class and 40-class datasets that the proposed method scores the highest for the 5-class dataset, slightly lower for the 10-class dataset, and still lower for the 40-class dataset. This is valid because as the number of classes increases, the complexity of the problem also increases. But if we consider the overall performance in terms of classification rate, the proposed method outperforms the others.

4.4 Validating the proposed classification using text detection and recognition experiments

The experiments in the previous sub-sections show that the proposed classification method is effective, scalable and generic. In order to show the advantage of the proposed classification, we conduct experiments for text detection and recognition on dataset containing 5 classes (our dataset) and another dataset containing 10 (STD dataset). However, there is no text detection and recognition experiment on the Stanford40 Actions dataset because it does not provide text information. Experiments on before and after classification are done to see the advantage of our proposed classification method. In case of before classification experiments, we gave all the images without separating classes as the input for text detection and recognition. The number of training and testing samples is determined based on the total number of images of all the classes. In case of after classification, the text detection and recognition method considers each class as the input for calculating measures, because with the proposed classification technique, one can do the same. Therefore, we can treat each class as one problem rather than all classes to enhance the performance of text detection and recognition after classification. For the text detection experiments we use EAST method [9], Zhu et al. [22], and for the recognition experiments we use E2E-MLT method [24], Lin et al. [29].

Qualitative results of the text detection technique [9] before (images of all the classes are input) and after classification (images of individual classes are input) for our dataset and STD are shown in Figs. 9 and 10, respectively, where it can be seen that the method provides better results after classification compared to before classification for almost all the classes. It is expected because the classifier used in the text detection method [9] is trained and parameters are tuned according to complexity of the classes. However, sometimes, the text detection method may not achieve better results after classification compared to before classification. This is due to the limitation of the text detection method and the lack of relevant samples to be trained. The same conclusion can be drawn from the qualitative results of the recognition method [24] on our dataset and STD dataset as shown in Figs. 11 and 12, respectively, where it can be noted that recognition results are better for after classification compared to before classification for almost all the classes. In this way, the proposed classification helps to improve text detection and recognition performance for classes of different complexities. In this study, we use existing text detection method [9] and recognition method [29] to show that the result after classification improves compared to the results of before classification. Based on our experiments, the same inferences can be drawn if we use other existing methods for comparison.

Fig. 9
figure 9

Sample results of the text detection [9] on our dataset before and after classification. Original Source: [8, 42]

Fig. 10
figure 10

Qualitative results of the text detection [9] on STD dataset before and after classification. Original Source: [8, 42]

Fig. 11
figure 11

Qualitative results of the recognition method on our dataset before and after classification. GT denotes ground truth, “$” denotes recognition method gives nothing

Fig. 12
figure 12

Qualitative results of the recognition method on STD dataset before and after classification. GT denotes ground truth, “$” denotes recognition method gives nothing

Quantitative results of the text detection methods [9, 22] before and after classification on our dataset and STD dataset are reported in Table 10, where it can be noted that there is a significant improvement for Yoga class of our dataset after classification at Recall, Precision and F-score compared to before classification. For other classes, we cannot see much improvement compared to before classification. This is because the number of training samples used may not be represented by the variations of classes, especially for Concert class. With these experiments, one can understand that the text detection method is good for Yoga class but not for the other classes. The reason is that usually the images of Yoga class contain caption text (which is edited text) but not scene text (natural text) like other classes. Since caption text is edited text, it has good clarity and visibility while scene text does not. Therefore, it leads one more option to choose an appropriate text detection method or develop a new method for achieving the best results for such deprived classes. This is the advantage of the proposed classification. For STD dataset, the text detection method gives better results at Recall for almost all the classes after classification compared to before classification. So, there is a significant difference between before classification and after classification at Recall.

Table 10 Performance of text detection methods (EAST and TextMountain) before and after classification on our and STD datasets

Quantitative results of the recognition method [24, 29] on our dataset and STD dataset reported in Table 11 show that there are huge improvements for the results after classification compared to before classification at recognition rate for almost all the classes of both the datasets. However, for E-Learning and Economics classes, the recognition methods show poor results compared to before classification. This is justifiable because texts in those classes suffer from too small font and low contrast, which demands for more and relevant samples for training parameters of the classifier. Therefore, to achieve better results for those two classes, we can choose different methods which can cope with the challenges of E-Learning and Economics classes.

Table 11 Performance of Text Recognition methods E2E-MLT [24] and STAN[29] on our and STD datasets before and after classification

Sometimes, when images share the similar background or foreground as shown in Fig. 13, the proposed approach misclassifies those. This is the limitation of the proposed model. Therefore, there is a demand for improvement in future. In this situation, rather than extracting using a few modalities, such as text and face, it is necessary to extract the relationship between foreground and background by adding more modalities, such as video information, which can be considered in near future.

Fig. 13
figure 13

Samples of unsuccessful classification results of the proposed approach on different datasets. Original Source: [42] and [8]

5 Conclusion and remark

We have proposed a new idea for the classification of action images to support the detection and recognition of embedded text. The proposed work combines three deep nets, namely, ResNet50 for overall pixel-distribution based classification, VGG16 for MSER components, and VGG16 for Face components for achieving enhanced classification performance. The proposed method exploits content of images at both pixel and component levels for tackling the complex classification problems at hand. Experimental results on our dataset of action images and two standard datasets show that the proposed method is effective and scalable. It is also noted from the experimental results that the proposed method outperforms existing methods in terms of classification rate for all three datasets. To show the significance of the proposed classification technique, text detection and recognition experiments before and after classification are conducted. Our future work would be dedicated towards solving such other challenges by combining features of text, image content and video.