A feature selection framework for video semantic recognition via integrated crossmedia analysis and embedded learning
Abstract
Video data are usually represented by high dimensional features. The performance of video semantic recognition, however, may be deteriorated due to the irrelevant and redundant components included into the high dimensional representations. To improve the performance of video semantic recognition, we propose a new feature selection framework in this paper and validate it through applications of video semantic recognition. Two issues are considered in our framework. First, while those labeled videos are precious, their relevant labeled images are abundant and available in the WEB. Therefore, a supervised transfer learning is proposed to achieve the crossmedia analysis, in which the discriminative features are selected by evaluating feature’s correlation with the classes of videos and relevant images. Second, the labeled videos are normally rare in realworld applications. In our framework, therefore, an unsupervised subspace learning is added to retain the most valuable information and eliminate the feature redundancies by leveraging both labeled and unlabeled videos. The crossmedia analysis and embedded learning are simultaneously learned in a joint framework, which enables our algorithm to utilize the common knowledge of crossmedia analysis and embedded learning as supplementary information to facilitate decision making. An efficient iterative algorithm is proposed to optimize the proposed learningbased feature selection, in which convergence is guaranteed. Experiments on different databases have demonstrated the effectiveness of the proposed algorithm.
Keywords
Feature selection Crossmedia analysis Embedded learning
 AA
Average accuracy
 BoW
Bagofwords
 EnFS
Ensemble feature selection
 FSFS
Fisher score feature selection
 FSNM
Feature selection via joint ℓ_{2,1}norm minimization
 JCAEL
Jointing crossmedia analysis and embedded learning
 JELSR
Joint embedding learning and sparse regression
 LMCSVM
Linear multiclass SVM
 LSR
Least square regression
 LSR_{21}
ℓ_{2,1}norm least square regression
 MCkNN
Multiclass kNN
 PCA
Principal component analysis
 SIFT
Scaleinvariant feature transform
 STIP
Spacetime interest points
 SVM_{21}
Multiclass ℓ_{2,1}norm support vector machine
1 Introduction
Video semantics recognition [1] is a fundamental research problem in computer vision [2, 3] and multimedia analysis [4, 5]. However, video data are always represented by high dimensional feature vectors [6], which often incur higher computational costs. The irrelevant and redundant features may also deteriorate the performance of video semantic recognition. In addition, feature selection [7] is able to reduce redundancy and noise information in the original feature representation, thus facilitating subsequent analysis tasks such as video semantic recognition.
Depending on whether the class label information are available, feature selection algorithms can be roughly divided into two groups [8], i.e., supervised feature selection [9] and unsupervised feature selection [10]. Supervised feature selection is able to select discriminative features by evaluating features’ correlation with the classes. Thus, supervised feature selection usually yields better and more reliable performances by using the label information. However, most of the supervised feature selection methods require sufficient labeled training data in order to learn reliable model [11]. Since it is difficult to collect highquality labeled training data in realworld applications [12], it is normally not practical to provide sufficient labeled videos for existing supervised feature selection methods to achieve satisfactory performances of feature selection. Recently, some crossmedia analysis methods [13, 14] have been proposed to address the problem of insufficient number of labeled videos by transferring knowledge from other relevant types of media (e.g., images). Therefore, this type of crossmedia analysis method can be considered as a kind of transfer learning. Moreover, some relevant labeled images are available and easier to collect, which can be leveraged to enhance the feature selection for video semantic recognition. To this end, we propose a supervised transfer learning in our framework, in which the knowledge from images are adapted to improve feature selection for video semantic recognition. Specifically, we use the available images with relevant semantics as our auxiliary resource and feature selection is performed on the target videos. To transfer the information from images to videos, we use the same type of still features to represent both videos and images.
Unsupervised feature selection exploits data variance and separability to evaluate feature relevance without labels. A frequently used criterion is to select the features which best preserve the data distribution or local structure derived from the whole feature set [15]. Recently, some unsupervised feature selection methods based on embedded learning have been proposed. The main advantage of utilizing embedded learning is that it can use the manifold structure of both labeled and unlabeled data to enhance the performance of feature selection. Further, most transfer learning algorithms require that the features extracted from the source domain should have the same type as that in the target domain. In practice, the videos and images in transfer learning usually need to be represented by still features such as SIFT [16]. For example, many videos are key framebased so they cannot be represented by motion features such as STIP [17], which results in losing the underlying temporal information. To completely represent the video semantics and to effectively use the unlabeled videos, we add an unsupervised embedded learning into our proposed framework, based on augmented feature representations. To take full advantages of crossmedia analysis and embedded learning, we assemble them into a joint optimization framework by introducing the joint ℓ_{2,1}norm regularization [18]. In this way, the information from crossmedia analysis and embedded learning can be transferred from one domain to another. Moreover, the problem of overfitting can be alleviated, and thus, the performance of feature selection can be improved. We call the proposed feature selection framework as jointing crossmedia analysis and embedded learning (JCAEL). We summarize the main contributions of this paper as follows:
(1) As JCAEL can transfer the learned knowledge from relevant images to videos for improving the video feature selection, it can directly use some labeled images to address the problem of an insufficient label information. Such a merit ensures that our method is able to uncover the common discriminative features in videos and images of the same class, which provides us with better interpretability of the features.
(2) Our method contains unsupervised embedded learning, which utilizes both labeled and unlabeled videos for feature selection. This advantage guarantees that JCAEL can exploit the variance and separability of all training videos to find the common irrelevant or noisy features and thus generating optimal feature subsets. Meanwhile, videos can be represented by augmented features during the process of embedded learning, and the augmented features present more complete representation of videos, providing us the space to select the precise features of video semantics.
(3) To take the advances of crossmedia analysis and embedded learning, we propose to ensemble them by adding a joint ℓ_{2,1}norm regularization. In this way, our algorithm is able to evaluate the informativeness of features jointly, where the correlation of features is employed. In addition, our proposed also enables crossmedia analysis and embedded learning to share the common components/knowledge of features, so as to uncover common irrelevant features, which results in improving the performance of feature selection for video semantic recognition.
The rest of this paper is organized as follows. The proposed method and its corresponding optimization approach are proposed in Section 2. In Section 3, the experimental results are reported. The conclusion is shown in Section 4.
2 Proposed method
In this section, we present the framework of JCAEL. To construct this framework efficiently, we develop an iterative algorithm and prove its convergence.
2.1 Notations
To adapt knowledge from images to videos, let us denote the representations of the labeled training videos as a still feature: \(X_{v}=\left [x_{v}^{1},x_{v}^{2},\ldots,x_{v}^{n_{l}}\right ]\in R^{d_{s}\times n_{l}}\) where d_{s} is the still feature dimension and n_{l} is the number of the labeled training videos. Let \(Y_{v}=\left [y_{v}^{1},y_{v}^{2},\ldots,y_{v}^{n_{l}}\right ]\in \{0,1\}^{c_{v} \times n_{l}}\) be the labels for the labeled training videos, where c_{v} indicates that there are c_{v} different classes in videos. Similarly, we denote the representations of the images by a still feature: \(X_{i}=\left [x_{i}^{1},x_{i}^{2},\ldots,x_{i}^{n_{i}}\right ]\in R^{d_{s}\times n_{i}}\), where n_{i} is the number of the images. \(Y_{i}=\left [y_{i}^{1},y_{i}^{2},\ldots,y_{i}^{n_{i}}\right ]\in \{0,1\}^{c_{i} \times n_{i}}\) is the label matrix of images, where c_{i} indicates that there are c_{i} different classes in images, \(y_{v}^{kj}\) and \(y_{i}^{kj}\) denote the jth datum of \(y_{v}^{k}\) and \(y_{i}^{k}\), \(y_{v}^{kj}=1\) and \(y_{i}^{kj}=1\) if \(x_{v}^{k}\) and \(x_{i}^{k}\) belong to the jth class; otherwise, we have \(y_{v}^{kj}=0\) and \(y_{i}^{kj}=0\). To fully utilize labeled and unlabeled videos, we use an augmented feature to denote n videos, which can be represented as \(Z_{v}=\left [z_{v}^{1},z_{v}^{2},\ldots,z_{v}^{n}\right ]\in R^{d_{a}\times n}\), where d_{a} is the dimension of the augmented feature. From the basic idea of feature learning, we represent the original data \(z_{v}^{j}\) by its low dimensional embedding, i.e., \(\phantom {\dot {i}\!}p_{j}\in R^{d_{e}}\), where d_{e} is the dimensionality of the embedding. As a result, the embedding of Z_{v} can be denoted as \(P_{v}=\left [p_{v}^{1},p_{v}^{2},\ldots,p_{v}^{n}\right ]\in R^{d_{e}\times n}\).
2.2 The proposed framework of JCAEL
where loss(.) is the loss function and αΩ(f) is the regularization with α as its parameter.
where \(\phantom {\dot {i}\!}W_{v}\in R^{d_{s}\times c_{v}}\) is the transformation matrix of the labeled videos with respect to the still feature, and ∥.∥_{F} denotes the Frobenius norm of a matrix. α is the regularization parameter. As indicated in [1, 22], the ℓ_{2,1}norm of W_{v} is defined as \(\left \ W_{v} \right \_{2,1} = \sum \limits _{j = 1}^{d_{s}} \sqrt {\sum \limits _{k = 1}^{c_{v}} { \left ({W_{v}^{jk} } \right)}^{2}} \), where \(W_{v}^{jk}\) is the jth row and the kth column element of W_{v}. When minimizing the ℓ_{2,1}norm of W_{v}, some rows of W_{v} shrink to zero, making W_{v} particularly suitable for feature selection.
where λ is the regularization parameter.
In Eq. (7), with the term ∥W∥_{2,1}, our algorithm is able to evaluate the informativeness of the features jointly for both knowledge adaptation and low dimensional embedding. Our algorithm further enables different feature selection functions to share the common components/knowledge across knowledge adaptation and low dimensional embedding. In this way, the information from knowledge adaptation and low dimensional embedding can be transferred from one domain to the other. On the other hand, ∥W∥_{2,1} enables W_{v}, W_{i}, and W_{z} to have the same sparse patterns and share the common components, which can result in an optimal W for feature selection. Since there are four parameters (i.e., W_{i}, W_{z}, P_{v}, and W_{v}) to be estimated in Eq. (7), the objective function in Eq. (7) is not jointly convex with respect to the four parameters, but it is convex with respect to one parameter when we fix the other parameters. Thus, we propose an alternating optimization algorithm [24] to solve the optimization problem of JCAEL.
2.3 Optimization
In this section, we introduce an optimization algorithm for the objective function in Eq. (7). As there exist a number of variables to be estimated, we propose an alternating optimization algorithm to solve the optimization problem in Eq. (7). Denote \(W_{v}=\left [w_{v}^{1};w_{v}^{2};\ldots w_{v}^{d_{a}}\right ]\), \(W_{i}=\left [w_{i}^{1};w_{i}^{2};\ldots w_{i}^{d_{a}}\right ]\), \(W_{z}=\left [w_{z}^{1};w_{z}^{2};\ldots w_{z}^{d_{a}}\right ],\) and \(W=\left [w^{1};w^{2};\ldots w^{d_{a}}\right ]\), where d_{a} is the number of features.
If A and L are fixed, the optimization problem in Eq. (21) can be solved by Eigendecomposition of the matrix \(\left (L+I_{n \times n}Z_{v}^{T}A^{1}Z_{v}\right)\). We pick up the eigenvectors corresponding to the d_{e} smallest eigenvalues.
Based on the above mathematical deduction, we propose an alternating algorithm to optimize the objective function in Eq. (7), which is summarized in Algorithm 1. Once W is obtained, we sort the d_{a} features according to ∥w^{k}∥_{F} (k=1,2,…,d_{a}) in a descending order and select the top ranked ones.
2.4 Convergence and computational complexity
2.4.1 Convergence
In this section, we theoretically show that Algorithm 1 proposed in this paper converges. We begin with the following lemma [22].
Lemma 1
As a result, the second lemma can be derived as described below.
Lemma 2
By fixing W_{i} and W_{v}, we obtain the global solutions for W_{z} and P_{v} in Eq. (7). Yet, by fixing W_{i}, W_{z}, and P_{v}, we obtain the global solutions for W_{v} in Eq. (7). In the same manner, by fixing W_{v}, W_{z}, and P_{v}, we obtain the global solutions for W_{i} in Eq. (7).
Proof
When W_{i} and W_{v} are fixed, the optimization problem in Eq. (7) is equivalent to the problem described in Eq. (17) and Eq. (21). We can solve the convex optimization problem with respect to W_{z} by setting the derivative of (17) to zero. Further, we can derive the global solution for P_{v} by solving the Eigendecomposition problem with respect to P_{v}. When W_{z}, P_{v}, and W_{i} are fixed, the optimization problem in Eq. (7) is equivalent to the problem described in Eq. (9). We can solve the convex optimization problem with respect to W_{v} by setting the derivative of Eq. (9) to zero. Thus, we derive the global solution for W_{v} in Eq. (7), provided that W_{z}, P_{v}, and W_{i} are fixed. Similarly, we can also derive the same conclusion when W_{i} is fixed. □
Theorem 1
The proposed algorithm monotonically decreases the objective function value of Eq. (7) in each iteration. Next, we prove Theorem 1 as follows.
Proof
□
This indicates that, with the updating rule in the proposed algorithm, the objective function value for Eq. (7) monotonically decreases until a convergence is reached.
2.4.2 Computational complexity
For the computational complexity of Algorithm 1, computing the graph Laplacian matrix L is O(n^{2}). During the training, learning W_{v}, W_{i}, and W_{z} involves calculating the inverse of a number of matrices, among which the most complex part is \(O\left (d_{a}^{3}\right)\). To optimize the P_{v}, the most timeconsuming operation is to perform eigendecomposition of the matrix \(ED=\left (L+I_{n \times n}Z_{v}^{T}A^{1}Z_{v}\right)\). Note that ED∈R^{n×n}. The time complexity of this operation is O(n^{3}) approximately. Thus, the computational complexity of JCAEL can be worked out as \(max\left \{O\left (t \times n^{3}\right),O\left (t \times d_{a}^{3}\right)\right \}\), where t is the number of iterations required for convergence. From the experiments, we observe that the algorithm converges within 10∼15 iterations, which indicates that our proposed algorithm is efficient in feature selection for video semantics recognition.
3 Experimental results and discussion
In this section, we propose the video semantic recognition experiments which evaluate the performance of our jointing crossmedia analysis and embedded learning (JCAEL) for feature selection.
3.1 Experimental datasets
In order to evaluate the contribution from crossmedia analysis, we construct three couples of video and image datasets, which include HMDB13 (video dataset) ←“Extensive Images Databases” (EID, image dataset), UCF10 (video dataset) ← Actions Images Databases (AID, image dataset), UCF (video dataset) ←PPMI4 (image dataset), where “ ←” denotes the direction of adaptation from images to videos. The videos and images of HMDB13 ←EID and UCF10 ←AID have the same semantic classes, and UCF ←PPMI4 has different semantic classes for videos and images.
3.1.1 HMDB13 ←EID
The classes of HMDB13 ← EID
Datasets  HMDB13 (video dataset)  EID (image dataset) 

The overlapping classes  Catch  Catching (Still DB) 
Clap  Applauding (Stanford40)  
Drink  Drinking (Stanford40)  
Jump  Jumping (Stanford40)  
Pour  Pouring liquid (Stanford40)  
Pushing  Pushing a cart (Stanford40)  
Run  Running (Stanford40)  
Smoke  Smoking (Stanford40)  
Wave  Waving hands (Stanford40)  
Kick  Kicking (Still DB)  
Throw  Throwing (Still DB)  
Walk  Walk (Still DB)  
Climbing  Climbing (Stanford40) 
3.1.2 UCF10 ←AID
The classes of UCF10 ←AID
Datasets  UCF10 (video dataset)  AID (image dataset) 

The overlapping classes  Biking  Riding bike (willowactions) 
Cricket bowling  Cricket bowling (action DB)  
Cricket shot  Cricket batting (action DB)  
Horse riding  Riding horse (willowactions)  
Playing cello  Playing cello (PPMI)  
Playing flute  Playing flute (PPMI)  
Playing violin  Playing violin (PPMI)  
Tennis swing  Tennis forehand (action DB)  
Volleyball spiking  Volleyball smash (action DB)  
Base ball pitch  Throwing (still DB) 
3.1.3 UCF ←PPMI4
The classes of UCF ←PPMI4
Datasets  UCF (video dataset)  PPMI4 (image dataset) 

The overlapping classes  Playing cello  Playing cello (PPMI) 
Playing flute  Playing flute (PPMI)  
Playing guitar  Playing guitar (PPMI)  
Playing violin  Playing violin (PPMI)  
Rock climbing in door  NULL  
Rowing  NULL  
Tennis swing  NULL  
Volleyball spiking  NULL  
Walking with dog  NULL  
Writing on board  NULL 
3.2 Experiment setup
For all the datasets, we select 30 images from each overlapping categories for knowledge adaption as the number of images is relatively small. We sample videos for labeled training data and take the remaining videos as the testing data. To evaluate the contribution from unsupervised subspace learning, we conduct experiments to study the performance variance when only a few labeled training samples are provided, and the ratios of labeled video data are set to 5%. For each dataset, we repeat the sampling for 10 times and report the average results. We extract SIFT features [16, 34] from the key frames of videos and images. The STIP features [17] are extracted from videos. We use the standard BagofWords (BoW) method [35, 36] to generate the BoW representation of SIFT and STIP features, where the number of visual words of BagofWords is set to 600. For videos, we obtain a still feature with 600 dimensions and an augmented feature with 1200 dimensions, and for images, we obtain a still feature with 600 dimensions.
3.3 Comparison algorithms

Full features (FF) which adopts all the features for classification. It is used as baseline method in this paper.

Fisher score feature selection (FSFS) [37]: a supervised feature selection method built by depending on fully labeled training data to select features with the best discriminating ability.

Feature selection via joint ℓ_{2,1}norms minimization (FSNM) [22]: a supervised feature selection method built by employing joint ℓ_{2,1}norms minimization on both loss function and regularization to realize feature selection across all data points.

ℓ_{2,1}norm least square regression (LSR_{21}) [22]: a supervised feature selection method built upon least square regression by using the ℓ_{2,1}norm as the regularization term.

Multiclass ℓ_{2,1}norm support vector machine (SVM_{21}) [20]: a supervised feature selection method built upon SVM by using the ℓ_{2,1}norm as the regularization term.

Ensemble feature selection (EnFS) [25]: a supervised feature selection method based on transfer learning, which transfer the shared information between different classifiers by adding a joint ℓ_{2,1}norm on multiple feature selection matrices.

Joint embedding learning and sparse regression (JELSR) [26]: a unsupervised feature selection method built by using the local linear approximation weights and ℓ_{2,1}norm regularization.

Jointing crossmedia analysis and embedded learning (JCAEL): our proposed method which is designed for feature selection by adapting knowledge from images based on still feature and utilizing both labeled and unlabeled videos based on augmented feature.
where c_{v} is the number of action classes. acc_{k} is the accuracy for the kth class.
3.4 Experimental results
Comparisons of feature selection algorithms on HMDB13 ←EID in terms of average accuracy using three classifiers when the ratio of labeled video data are set to 5%
Classifiers  FF  FSFS  FSNM  LSR_{21}  SVM_{21}  EnFS  JELSR  JCAEL 

LMCSVM  0.3032  0.3187  0.3137  0.3117  0.3137  0.3182  0.3387  0.3526 
LSR  0.1763  0.2138  0.1898  0.1903  0.1873  0.2003  0.2233  0.2318 
MCkNN  0.1898  0.2647  0.2517  0.2582  0.2093  0.2697  0.2737  0.2877 
Comparisons of feature selection algorithms on UCF10 ←AID in terms of average accuracy using three classifiers when the ratio of labeled video data are set to 5%
Classifiers  FF  FSFS  FSNM  LSR_{21}  SVM_{21}  EnFS  JELSR  JCAEL 

LMCSVM  0.4340  0.4448  0.4715  0.4355  0.4340  0.4348  0.5061  0.5299 
LSR  0.2906  0.3136  0.3043  0.3057  0.3028  0.3064  0.3180  0.3302 
MCkNN  0.3360  0.3684  0.3670  0.3360  0.3360  0.3381  0.3756  0.4001 
 (1)
The results of feature selection algorithms are generally better than that of full features (FF). As the classification could be much faster by reducing the feature number, feature selection proves to be more crucial in practical applications.
 (2)
As the number of labeled training videos increases, the performance of all methods is improved. This is consistent with the general principle as more information is made available for training.
 (3)
The classification using multiclass SVM and multiclass kNN achieve better performance than the least square regression when the ratio of labeled video data are set to 5%. The main reason is that the threshold learned from the small size of training data leads to a bias in the quantization of continuous label prediction scores.
 (4)
When the ratio of labeled video data are set to 5%, JELSR is generally the second most competitive algorithm. This indicates that incorporating the additional information contained in the unlabeled training data through unsupervised embedded learning is indeed useful.
 (5)
As shown in Figs. 1–2, supervised methods based on transfer learning (EnFS) always achieve better performances than other compared methods when the number of labeled training videos is enough (e.g., the ratio of labeled video data are set to 40%), since EnFS can uncover common irrelevant features by transferring the relative information between different classifiers.
 (6)
As shown in Figs. 1–2, our proposed JCAEL remains to be the best performing algorithm among different methods and different cases. The main reason is that our method can take advantages of both transfer learning and embedded learning. We can also see from Tables 4–5 that JCAEL algorithm achieves the best results when only a small number of labeled training videos are available. This advantage is especially desirable for realworld problems since precisely annotated videos are often rare.
3.5 Experiment on convergence
3.6 Experiment on parameter sensitivity
3.7 Experiment on selected features
3.8 Experiment on embedding features
3.9 Influence of crossmedia analysis and embedded learning
To further investigate the effectiveness of the integrated crossmedia analysis and embedded learning, we construct three new algorithms: (1) embedded learning part (ELP), which is the unsupervised embedded learning part of JCAEL (i.e., Eq. (6)). ELP utilizes both labeled and unlabeled videos as the training dataset, and the augmented feature is used to represent each video by ELP. (2) Crossmedia analysis part (CAP), which is the transfer learning part of JCAEL (i.e., Eq. (4)). CAP transfers the knowledge from images to labeled videos, and only the still feature is used to transfer the knowledge by CAP.
Comparisons of feature selection algorithms on UCF ←PPMI4 in terms of average accuracy using three classifiers when the ratio of labeled video data are set to 5%
classifiers  ELP  CAP  JCAEL 

LMCSVM  0.5537  0.5456  0.5804 
LSR  0.3958  0.3944  0.4085 
MCkNN  0.4085  0.3988  0.4433 
From the results presented in Table 6 and Fig. 7, we can make the following observations: (1) Among different methods and different labeled ratios, JCAEL perform best. It achieves the highest accuracy in most cases, especially when only few labeled training videos are provided. This is mainly due to the fact that (1) JCAEL benefits from the unsupervised embedded learning which can utilize both labeled and unlabeled data, (2) JCAEL leverages the knowledge from images to boost its performances, and (3) JCAEL integrates transfer learning and embedded learning into a joint optimization framework. In this way, gains from optimization are augmented. (2) The performance of JCAEL is generally better than that of ELP for all the labeled ratios, indicating that the JCAEL is able to use the extra knowledge from images to achieve higher accuracy. (3) JCAEL generally outperforms CAP, indicating that it is beneficial to utilize unlabeled videos for video semantic recognition, especially when the number of labeled data is not sufficient.
4 Conclusions
There are many labeled images and unlabeled videos in real world. To achieve good performance for video semantic recognition, we propose a new feature selection framework, which can borrow the knowledge transferred from images to achieve its performance improvements. Meanwhile, it can utilize both labeled and unlabeled videos to enhance the performance of semantic recognition in videos. Extensive experiments validate that the knowledge transferred from images and the information contained in unlabeled videos can be used indeed to select more discriminative features, leading to the enhancement of recognition accuracies of semantics inside videos. In comparison with the existing state of the arts, the experimental results show that the proposed JCAEL has better performances in video semantics recognition. Even under the circumstance that only a few labeled training videos are available, our proposed JCAEL still performs competitive among all the compared existing state of the arts, leading to a high level of flexibility for its applications in real world.
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.