Personalized Video Summarization Based Exclusively on User Preferences
- 3.2k Downloads
We propose a recommender system to detect personalized video summaries, that make visual content interesting for the subjective criteria of the user. In order to provide accurate video summarization, the video segmentation provided by the users and the features of the video segments’ duration are combined using a Synthetic Coordinate based Recommendation system.
KeywordsRecommender system Video summarization
Video summarization is an application of recommender systems [9, 13] that generally aims at providing users with targeted information about items that might interest them. Recommender systems are also used to provide users with suggestions for various entities such as e-shop items, web pages, news, articles, movies, music, hotels, television shows, books, restaurants, friends, etc.
In this work, we study the problem of personalized video summarization without an priori knowledge of the video categories. According to our knowledge, this is the first work that solves the personalized video summarization based exclusively on user preferences for a given dataset of videos. In order to solve this problem, we propose a video segmentation method that yields global video segments. The main contribution of this work is the proposed video segmentation method and the efficient combination of the video segments’ duration attribute with the Synthetic Coordinate based Recommendation system (SCoR)  without the use complex audiovisual features.
2 Related Work
The problem of content recommendation can be described as follows. Given a set U of users, a set I of items and a set R of user ratings for items, we need to predict ratings for user-item pairs which are not in R. One of the main recommender system techniques is similarity-based Collaborative Filtering . Such algorithms are based on a similarity function which takes into account user preferences and outputs a similarity degree between pairs of users. Another important approach in recommender systems is Dimensionality Reduction. Each user or item in the system is represented by a vector. A user’s vector is the set of his ratings for all items in the system (even those that have not been rated by the specific user). The Matrix Factorization method  that characterizes both items and users by vectors of latent factors inferred from item rating patterns, is also a Dimensionality Reduction technique. High correlation between item and user factors leads to a recommendation.
In , the SCoR recommender system has been proposed that assigns synthetic coordinates to users and items (nodes). SCoR assigns synthetic coordinates (vectors) to users and items as proposed in , but instead of using the dot product, SCoR uses the Euclidean distance between a user and an item in the Euclidean space, so that, when the system converges, the distance between a user-item pair provides an accurate prediction of that user’s preference for the item. SCoR has been also successfully applied to the distributed community detection problem  and to the interactive image segmentation problem .
A video summary usually includes the most important scenes and events from a video, with the shortest possible description. Many traditional video summarization approaches, which are not personalized, [8, 16] find a global optimal representation of a given video taking into account only its audiovisual features. As the given, video synopsis datasets and annotations increase, the computer vision community realized that the problem of video summarization can be also defined and solved separately for each user taking into account his preferences. Thus, the research on personalized video summarization is gaining increased attention recently .
3 Personalized Video Summarization
In this Section, the proposed personalized video summarization method is described. Figure 1 depicts the two stages of the proposed framework. In the first stage, each video is segmented into non overlapping segments according to the preferences of the users. In the second stage, the personalized rankings of the video segments are provided.
3.1 Video Segmentation
The goal of video segmentation is to provide the candidate video segments that are included in the video summarization, significantly reducing the problem search space from the set of frames to the set of video segments. The simplest video segmentation is to use fixed segments (e.g. of 5 s duration) . Several audiovisual based video summarization methods use shot detection  or other more complex temporal segmentation approaches [7, 19] to provide accurate (non-overlapping) video segmentation. In this work, since the audiovisual data are not taken into account, we take advantage of the user preferences in the training set to derive the video segmentation.
3.2 Video Segments Duration
3.3 Ranking Video Segments
In the final stage of the proposed method, the video segments are ranked by combining the segment duration based on the ranked function \(D(x_i)\) and the ranking of video segments provided by the SCoR system.
Similarly to , in order to train SCoR, we get all video segments (see Sect. 3.1) of each video v that have been summarized by user u. Let \([F_v(i) ,F_v(i-1)]\) be the video segment i of video v, then the recommendation \(R_u(i)\) of user u for this segment, that is used to train the SCoR, is given by the percentage of the video segment frames \([F_v(i) ,F_v(i-1)]\) that belong to the video summary that user u provides. This means that \(R_u(i) \in [0,1]\).
4 Experimental Results
SCOR: The variant of the proposed method that only uses the SCoR system.
\(SCOR-FIX\): The variant of the proposed method that combines SCoR with fixed length (5 s, as proposed in ) video segmentation.
RANDOM: Random summaries based on the proposed video segmentation.
To obtain personalized video highlight data, we have used the large scale dataset proposed in , that contains 13,822 users and 222,015 annotations on 119,938 YouTube videos. Due to the fact that our method is only based on user preferences, we keep users and videos with at least five annotations in order to be able to provide recommendations (cold start problem). The resulting dataset consists of 1822 users and 6347 annotations on 381 videos with 129,890 candidate video segments under the proposed video segmentation with variable segment lengths, and 199,462 video segments with fixed, 5 s, segment length. The dataset was randomly separated into training and test sets, as proposed in . In the test set, we included annotations from 191 users concerning their last (191) annotated videos (\(50\%\) of the given videos).
Comparison with the state-of-the-art comparison
PHD-CA + SVM-D
Table 1 presents the average mAP, nMSD and \(F_1~score\). It holds that the proposed method \(SCOR-D\) clearly outperforms all the remaining methods under any evaluation metric. The importance of the duration attribute and the proposed variable length video segmentation is verified by comparing the results of the proposed method against SCOR and \(SCOR-FIX\), respectively. The \(F_1~score\) of the proposed method is \(9\%\) and \(13\%\) higher than the \(F_1~score\) of SCOR and \(SCOR-FIX\), respectively. SCOR is the second method in performance, while \(SCOR-FIX\) is the third one, under any evaluation metric. Finally, it should be noted that the performances of \(PHD-CA + SVM-D\) and Video2GIF have been obtained in the whole dataset of , so they are not directly comparable with the other methods.
In this work, we presented a methodology to detect personalized video highlights without taking into account audiovisual features. The proposed method efficiently uses known user preferences to derive a video segmentation and it combines the segment duration attribute with the SCoR recommender system , yielding accurate personalized video summarization. According to our experimental results, the proposed system outperforms other variants and methods from literature. The proposed methodology can be extended to include rich audiovisual features , in order to be able to provide personalized user summaries even for unseen videos.
This research has been co-financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH - CREATE - INNOVATE (project code: T1EDK-02147).
- 2.Gorrell, G.: Generalized Hebbian algorithm for incremental singular value decomposition in natural language processing. In: EACL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 3–7 April 2006, Trento, Italy (2006)Google Scholar
- 3.Gygli, M.: Ridiculously fast shot boundary detection with fully convolutional neural networks. In: 2018 International Conference on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2018)Google Scholar
- 4.Gygli, M., Song, Y., Cao, L.: Video2gif: automatic generation of animated gifs from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1001–1009 (2016)Google Scholar
- 6.del Molino, A.G., Gygli, M.: Phd-gifs: personalized highlight detection for automatic gif creation. arXiv preprint arXiv:1804.06604 (2018)
- 9.Panagiotakis, C., Papadakis, H., Fragopoulou, P.: Detection of hurriedly created abnormal profiles in recommender systems. In: International Conference on Intelligent Systems (2018)Google Scholar
- 15.Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar
- 17.Vasudevan, A.B., Gygli, M., Volokitin, A., Van Gool, L.: Query-adaptive video summarization via quality-aware relevance estimation. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 582–590. ACM (2017)Google Scholar
- 18.Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2235–2244 (2015)Google Scholar
- 19.Yao, T., Mei, T., Rui, Y.: Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 982–990 (2016)Google Scholar