Crowdsourced Time-Sync Video Recommendation via Semantic-Aware Neural Collaborative Filtering

Wu, Zhanpeng; Zhou, Yan; Wu, Di; Zhou, Yipeng; Qin, Jing

doi:10.1007/978-3-030-19274-7_13

Zhanpeng Wu^17,18,
Yan Zhou^17,18,
Di Wu^17,18,
Yipeng Zhou¹⁹ &
…
Jing Qin²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11496))

Included in the following conference series:

International Conference on Web Engineering

1803 Accesses
2 Citations

Abstract

As an emerging type of video comments, time-sync comments (TSCs) enable viewers to make comments on video shots in a real-time manner. Such comments well reflect user interests in the frame level, which can be utilized to further improve the accuracy of video recommendation. In this paper, we make the first attempt in this direction and propose a new video recommendation algorithm called SACF by exploiting temporal relationship between time-sync comments and video frames. Our algorithm can extract a rich set of semantic features from crowdsourced time-sync comments, and combine latent semantic representations of users and videos by neural collaborative filtering. We conduct extensive experiments using real TSC datasets, and our results show that our proposed algorithm can improve the recommendation performance by 9.73% in HR@10 and 5.72% in NDCG@10 compared with other baseline solutions.

You have full access to this open access chapter, Download conference paper PDF

Video clip recommendation model by sentiment analysis of time-sync comments

Article 03 May 2019

Personalized time-sync comment generation based on a multimodal transformer

Article 30 March 2024

Learning from Audience Intelligence: Dynamic Labeled LDA Model for Time-Sync Commented Video Tagging

Keywords

1 Introduction

The past decade has witnessed a dramatic increase of online videos, including online TV episodes, online movies, user-generated contents, livecast, and etc. The traffic generated by world-leading online video websites (e.g., Youtube, Netflix, Tencent Video, Hulu) has dominated the whole Internet backbone. In our daily life, users watch online videos for learning, news, and funny stuff. Normally, users tend to make comments on video after watching. In recent years, there emerges a new type of video comments, called Time-Sync Comments (or Danmu, bullet-screen comments), which allow a user to make comments on video shots in a real-time manner. Time-sync comments are flying across the screen and people who are watching the same video also see the flying comments. To date, the service of time-sync comments has been provided by quite a few online video websites, such as YouTube^{Footnote 1}, Twitch^{Footnote 2}, AcFun^{Footnote 3}, BiliBili^{Footnote 4}, NicoNico^{Footnote 5}, and so on. In Fig. 1, we show an example of a video clip^{Footnote 6} with time-sync comments.

Different from conventional video comments, time-sync comments are synchronized with a video’s playback time. It provides a possibility for viewers who are watching the same video to share their watching experience and interact with each other. Latent features can be extracted from time-sync comments to provide more detailed information on user interests. For instance, viewers who wrote comments in nearby playback time positions are likely to have some kinds of similarity or association (e.g., like or dislike a specific video shot). Intuitively, those viewers can be categorized into the same group with implicit similar preferences. Moreover, a continuous bundle of time-sync comments can describe video contents to some extent. Such kind of information is useful for video recommendation to further improve user experience.

In this paper, we propose a new video recommendation algorithm for crowdsourced time-sync videos, which is called SACF (Semantic-Aware Collaborative Filtering). The basic idea of SACF is to exploit the temporal relationship between time-sync comments and video frames, and extract latent semantic representations of time-sync comments to provide more accurate video recommendation. Our proposed algorithm can model user preferences in the frame level. In summary, our main contributions in this paper can be listed as below:

We propose a novel video recommendation algorithm called SACF to improve the performance of crowdsourced time-sync videos. Our algorithm extends traditional video recommendation algorithms by embedding latent semantic representations extracted from TSCs.
To better utilize interaction patterns, we integrate all the representations with a multi-layer perceptron (MLP) model. By embedding extra semantic-aware information, our approach can easily achieve similar user and item interest filtering, and mitigate the cold-start problem.
We also validate our proposed algorithm using a real TSC dataset obtained from the BiliBili video website. The experiment results show that our algorithm significantly outperforms other baselines by up to 9.73% in HR@10 and 5.72% in NDCG@10.

The rest of our paper is organized as follows: we first introduce the relevant work in Sect. 2. We describe the details of our algorithm in Sect. 3. The dataset and experiments are introduced in Sect. 4. Finally we conclude the whole paper and discuss our future work in Sect. 5.

2 Related Work

The topic of recommender systems has been extensively studied in the past years. He et al. [8], Koren et al. [13], Mnih and Salakhutdinov [19] have shown the excellent ability of matrix factorization model in the field of rating prediction problem. In addition to such classic research question, Top-K recommendation using implicit feedback are also worthy of attention. Hu et al. [9] and Rendle et al. [21] are the masterpieces of them. Moreover, in recent years, neural network models [29] have gained widespread attention because of their ability to easily fit multi-dimensional features and learn nonlinear relationships, which are concerned by Covington et al. [4] and He et al. [7].

With the development of this field, in order to improve the effectiveness in user preference modeling, the incorporation of contextual information has attracted major research interests, such as the work of Adomavicius and Tuzhilin [1] and Verbert et al. [25]. User review is one of the most effective contextual information to model user preference and these algorithms are receiving more and more research attention. Tang et al. [23] aims to incorporate user- and product- level information for document sentiment classification. Tang et al. [24] leverages the reviews for user modeling and predicts the rating of user review. In addition, lots of the review-based models focus on enhancing the effectiveness of rating prediction, like Ganu et al. [5], McAuley and Leskovec [17]. To achieve better overall recommendation performance, Liu et al. [15], Tan et al. [22], Wu and Ester [27] extract the user opinions from review text and combine these information into the conventional models for higher recommendation accuracy. Besides, Zhang et al. [30] integrates traditional matrix factorization technology with word2vec model proposed by Mikolov et al. [18] for precise modeling. Notably, these proposed models mainly focus on rating prediction problem, and learn knowledge from traditional reviews which are longer than time-sync comments and contain richer content and semantics.

Time-synchronized comment is first introduced by Wu et al. [26] for video shot tagging. Recently, as an emerging type of user-generated comment, TSC has many practical properties to describe a video in frame-level. As mentioned by He et al. [6], herding effect and multiple-burst phenomena make TSC significant different from the traditional reviews, and it also proves that TSCs have a great correlation with video frame content and user reactions. Thus this emerging comment type is worthy of being used for video highlight shot extraction and content annotation, presented by Xian et al. [28] and Ikeda et al. [10]. Besides, an increasing number of models are trying to label videos based on TSCs, such representative work as Lv et al. [16]. Chen et al. [3] and Ping [20] further leverages the TSCs to extract the features of video frames and applies it to key frame recommendation. However, these methods are clinging to the video shot itself and do not consider the co-occurrence among TSCs.

Compared with the aforementioned approaches, we conduct a semantic-aware collaborative filtering algorithm, which can efficiently extract latent representations from TSCs within the videos and achieve significantly performance improvement in Top-K recommendation task.

3 Design of Semantic-Aware Video Recommendation Algorithm

By fusing latent semantic representation from video TSCs and traditional interaction paradigm as a composite entity, we propose our semantic-aware collaborative filtering (SACF) video recommendation algorithm, which has flexibility and non-linearity profited by exploiting multi-layer perceptron as a fundamental framework. We will discuss the details of our algorithm in the following subsections.

3.1 Problem Definition

We first define the problem formally. Suppose there are N users ${{\varvec{u}}} = \{u_1,u_2,\ldots ,$ $u_N\}$, M videos ${{\varvec{i}}} = \{i_1,i_2,\ldots ,i_M\}$ and T TSCs ${{\varvec{c}}} = \{c_1,c_2,\ldots ,c_T\}$. The TSC written by user u leaving in video i at video time t is defined as a 2-tuple ${<}u_i^t, c_i^t{>}$. Thereby, for each video $i \in {{\varvec{i}}}$, we can obtain two different sequences, that is, TSC writer sequence ${{\varvec{s}}}_{i}^{(u)} = (u_{i}^{1},u_{i}^{2},\ldots ,u_{i}^{T_i})$ and TSC content sequence ${{\varvec{s}}}_{i}^{(c)} = (c_{i}^{1},c_{i}^{2},\ldots ,c_{i}^{T_i})$, where $T_i$ is the amount of TSCs in video i and each element within the sequence is ranked by its video time.

Suppose the representation of user u is defined as ${{\varvec{w}}}_u$, and the representation of video i is defined as ${{\varvec{d}}}_i$. The user semantic representations ${{\varvec{W}}}=\{ {{\varvec{w}}}_u | u \in {{\varvec{u}}} \}$ and video semantic representations ${{\varvec{D}}}=\{ {{\varvec{d}}}_i | i \in {{\varvec{i}}} \}$ are learned from the sequencing data ${{\varvec{s}}}_{i}^{(u)}$ and ${{\varvec{s}}}_{i}^{(c)}$, which exploit the improved word embedding technology.

Given all the historical user-video interaction data $\mathcal {D}=\{ \mathcal {O},\mathcal {O}^{-} \}$, where $\mathcal {O}$ and $\mathcal {O}^{-}$ denote the positive and negative instances, respectively. For a user u and a set of corresponding unseen videos, our goal is to find a interaction function $f(\cdot )$ to rank all the unseen videos based on how much s/he like the video. The top K most likely to watch videos are the final results recommended to user u. For reference, we list the notations used throughout the algorithm in Table 1.

Table 1. Notations used in SACF algorithm

Full size table

3.2 Latent Semantic Representation

In this section, we explain the methods to capture latent semantic representation, which aim to extract the similarity lurked in users and videos. Before digging into the details, to better model the information of Internet slangs, we need to conduct a few data preprocessing work:

Different character components within a TSC may indicate different meanings. Thus, we will split a complete TSC into multiple substrings, where each substring represents a series of consecutive characters of the same type and these substrings will be treated as part of the TSC set ${{\varvec{c}}}$ as well. These character types can be English letter, pure number and Chinese character. For instance, TSC “awesome!!2333” will be treated as two TSCs, i.e.“awesome" and “2333” (laughter).
Moreover, the excessive TSCs will obviously impair the performance of the algorithm. Thereby, those Chinese substrings that are too long will also be segmented (the threshold of segmentation in our experiments is more than five consecutive Chinese characters).

Inspired by [14, 18], we propose the improved word embedding methods to learn representations. In this schema, each user is mapped to a unique latent vector ${{\varvec{w}}}_{u}$ and each vector is represented as one column of the matrix ${{\varvec{W}}}$ where ${{\varvec{W}}}$ indicates the user semantic representation matrix. Given the sequence of user ${{\varvec{s}}}_{i}^{(u)}$ from a finite user set ${{\varvec{u}}}$, the objective function aiming at maximizing is formulated as follows,

$$\begin{aligned} \frac{1}{T_i} \sum _{t=k}^{T_i-k} \sum _{-k\le j\le k, j\ne 0} \log p(u_{i}^{t} | u_{i}^{t+j}) \end{aligned}$$

(1)

where k is the context window size and $p(\cdot )$ is the softmax function,

$$\begin{aligned} p(u_{i}^{t} | u_{i}^{t+j}) = \frac{\exp ( {{{\varvec{w}}}_{u_{i}^{t}}}^T \cdot {{\varvec{w}}}_{u_{i}^{t+j}})}{\sum _{u^{'} \in {{\varvec{u}}}} \exp ( {{{\varvec{w}}}_{{u_{i}^{t+j}}}}^T \cdot {{\varvec{w}}}_{u^{'}} )} \end{aligned}$$

(2)

After the training converges, users who have the similar watching patterns will be projected to a similar position in the vector space. We leave these user semantic representations ${{\varvec{W}}}=\{ {{\varvec{w}}}_u | u \in {{\varvec{u}}} \}$ for later use. Likewise, TSC and video are mapped to a unique latent vector ${{\varvec{w}}}_c$ and ${{\varvec{d}}}_i$, respectively. Each vector is a column of their respective matrix, ${{\varvec{W}}}_c$ and ${{\varvec{D}}}$ where ${{\varvec{D}}}$ indicates the video semantic representation matrix. We will further leverage the information within ${{\varvec{W}}}$ and ${{\varvec{D}}}$ in the next subsection. Given the sequence of ${{\varvec{s}}}_{i}^{(c)}$ from a finite TSC set ${{\varvec{c}}}$, the objective function is defined as,

$$\begin{aligned} \frac{1}{T_i} \sum _{t=k}^{T_i-k} \sum _{-k\le j\le k, j\ne 0} \log p(c_{i}^{t} | c_{i}^{t+j}) \end{aligned}$$

(3)

Similarly, the softmax function can be formulated as,

$$\begin{aligned} p(c_{i}^{t} | c_{i}^{t+j}) = \frac{\exp ( {{{\varvec{z}}}_{c_{i}^{t},i}}^T \cdot {{\varvec{z}}}_{c_{i}^{t+j}})}{\sum _{c^{'} \in {{\varvec{c}}}\ and\ i^{'} \in {{\varvec{i}}}} \exp ( {{{\varvec{z}}}_{{c_{i}^{t+j}},i}}^T \cdot {{\varvec{z}}}_{c^{'},i^{'}} )} \end{aligned}$$

(4)

where the latent vector ${{\varvec{z}}}$ is constructed from ${{\varvec{w}}}_c$ and ${{\varvec{d}}}_i$. In particular, the vector ${{\varvec{z}}}$ is the sum of ${{\varvec{w}}}_c$ and ${{\varvec{d}}}_i$. Our ultimate goal is to obtain the video semantic representation set ${{\varvec{D}}}=\{ {{\varvec{d}}}_i | i \in {{\varvec{i}}} \}$ for the next step.

3.3 Algorithm Design

We now elaborate the SACF algorithm in details. Our approach employs a deep neural network, which is empowered the capability to learn the non-linear interactions from input data and can be easily embedded with extra features. Therefore, in addition to treat the identify of a user and a video in pure collaborative filtering as basic input feature, SACF also transforms the video TSC corpus to two kinds of representations, which are generated from user and video latent semantic information. As shown in Fig. 2, on the top of the input layer, each user and video is mapped to two corresponding vectors, i.e. identification dense vector and latent semantic representation vector. The generation methods of latent semantic representation have been fully explained in Sect. 3.2. And then such a 4-tuple of embedding vectors is fed into a multi-layer neural network, where each layer is combined with fully connection. The final output layer represents a predictive probability $\hat{r}_{ui}$, which is trained by minimizing the binary cross-entropy loss between $\hat{r}_{ui}$ and its target value $r_{ui}$.

Consequently, we can further formulate the SACF algorithm as,

$$\begin{aligned} \hat{r}_{ui} = f( {{\varvec{P}}}^T {{\varvec{v}}}_u, {{\varvec{Q}}}^T {{\varvec{v}}}_i, {{\varvec{W}}}^T {{\varvec{v}}}_u, {{\varvec{D}}}^T {{\varvec{v}}}_i, |{{\varvec{P}}},{{\varvec{Q}}},{{\varvec{W}}},{{\varvec{D}}}, \varTheta ) \end{aligned}$$

(5)

where ${{\varvec{P}}}$ and ${{\varvec{Q}}}$ denote the latent factor matrix for users and videos respectively. ${{\varvec{W}}}$ is the user semantic representation matrix, and analogously, ${{\varvec{D}}}$ is the video semantic representation matrix which extracts the latent features from the TSC corpus. ${{\varvec{v}}}_u$ and ${{\varvec{v}}}_i$ separately denote the index vector of user u and video i. $\varTheta $ represents the parameters of the interaction function $f(\cdot )$.

As mentioned above, the function $f(\cdot )$ can be further defined as a multi-layer neural network,

(6)

where $\sigma $ is the mapping function of output layer and $\varPhi _L$ denotes the $L^{th}$ hidden layer in neural network. More specifically, we formulate each layer as follows,

$$\begin{aligned} \begin{aligned} {{\varvec{x}}}_1&= \varPhi _1( {{\varvec{p}}}_u, {{\varvec{q}}}_i, {{\varvec{w}}}_u, {{\varvec{d}}}_i )= \begin{bmatrix} {{\varvec{p}}}_u&{{\varvec{q}}}_i&{{\varvec{w}}}_u&{{\varvec{d}}}_i \end{bmatrix}^{T}\\ {{\varvec{x}}}_2&= \varPhi _2({{\varvec{x}}}_{1}) = g_2({{\varvec{A}}}_2^T {{\varvec{x}}}_{1} + {{\varvec{b}}}_1)\\&\ldots \\ {{\varvec{x}}}_L&= \varPhi _L({{\varvec{x}}}_{L-1}) = g_L({{\varvec{A}}}_L^T {{\varvec{x}}}_{L-1} + {{\varvec{b}}}_L)\\ \hat{r}_{ui}&= \sigma ({{\varvec{h}}}^T {{\varvec{x}}}_L) \end{aligned} \end{aligned}$$

(7)

where ${{\varvec{A}}}_L$, ${{\varvec{b}}}_L$ and $g_L$ respectively denote the weight matrix, bias vector and activation function of the $L^{th}$ layer. ${{\varvec{p}}}_u$ and ${{\varvec{q}}}_i$ are the dense vector embedded with the one-hot encoding of user u and video i. ${{\varvec{w}}}_u$ and ${{\varvec{d}}}_i$ are latent semantic representation generated from crowdsourced TSC data. ${{\varvec{h}}}$ represents the weight vector of the output layer.

To endow a probabilistic explanation for SACF, we need to constraint $\hat{r}_{ui}$ in range of [0, 1], which can be achieved by adopting Logit or Probit as an activation function in output layer. We finally optimize SACF by minimizing the binary cross-entropy loss, and the objective function of SACF can be formulated as,

$$\begin{aligned} \mathcal {L}= -\sum _{(u,i) \in \mathcal {O} \cup \mathcal {O}^{-}} r_{ui} \log \hat{r}_{ui} + (1-r_{ui}) \log (1-\hat{r}_{ui}) \end{aligned}$$

(8)

where $\mathcal {O}$ denotes the set of observed interactions, and $\mathcal {O}^{-}$ denotes the set of unobserved interactions. To improve the training efficiency, $\mathcal {O}^{-}$ can be regarded as the negative instances sampled from all the unobserved interactions.

The semantic-aware neural collaborative filtering algorithm is illustrated in Algorithm 1. The algorithm can be considered as a two-stage process. Precisely, we pretrain the semantic representation of users and items in the first stage and embed it with the user-video interaction data in the second stage to maintain a holistic recommendation task. The algorithm works when the video has a series of TSC data. These data contain meta information about users and videos, and thus even if the user rarely sends any TSC, the algorithm can still leverage the implicit information to get the probability of watching a video.

4 Performance Evaluation

To demonstrate the superiority of our method, a time-sync video dataset crawled from Bilibili website is utilized for performance evaluation. We will discuss an overview of the dataset and experiment settings before presenting the experimental results.

4.1 Dataset Overview

Bilibili is one of the most popular TSC video sharing websites in China, and leads a trend of video interaction via TSC. We collected the video meta information and its corresponding TSC data from Bilibili website till December 15th, 2018. Note that the video TSC data we collected is only a part of fully historical data, because the platform will periodically remove the stale TSC data from the TSC pool and remain the latest TSC data. For the sake of reflecting the experimental results more significantly, we mainly focus on the data of gaming category, which attracts the most traffic on Bilibili. Based on these premises, our collected dataset totally contains 57,294 users and 2,637 videos. All these videos include 836,806 TSCs and 3,483 user-generated tags. More detailed statistic analyses are presented in Table 2.

Table 2. Overall statistics of our time-sync comment dataset

Full size table

4.2 Experiment Settings

Evaluation Metrics. To evaluate the performance of TSC video recommendation, we employ two widespread adopted metrics, that is, Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) [11]. These two metrics can respectively measure the classification and ranking performance in recommendation problem.

Baselines. Besides SACF method, we also implement two other algorithms for comparison, which are described as below,

MLP is proposed by [7]. It is a pure collaborative filtering method, which uses only the identifies of user and item as embedding features. Previous work has shown its strong generalization ability benefited from DNN model.
TCF is a variant of SACF. In contrast to SACF, TCF embeds the user-generated video tag information instead of the TSC information in SACF. It is a highly competitive baseline for video recommendation that fuses with conventional content features.

Parameter Settings. For better generality and comparison, we chose the widely used leave-one-out [2, 7, 8] evaluation schema, which holds the latest interaction as a test case and use the rest as training set. All the algorithms are optimized by the cross-entropy function defined in Eq. (8), where we randomly sample 4 negative instances for each positive instance. We use Adam [12] as optimizer to train algorithms, and fix the batch-size and learning rate at 256 and 0.001. Besides, in fairness, the number of hidden layers is set to 4 in all the experiments.

4.3 Experiment Results

Experiment 1: Performance Comparison. Since the size of last hidden layer implies the learning capability in DNN models, we can evaluate the performance in different factor size to achieve comprehensive comparison. Figure 3 illustrates that SACF outperforms other two baselines with the factors of 8, 16, 32 and 64 in both metrics when embedding size (ES) is set to 64. And the overall performance trend is presented as MLP<TCF<SACF. Meanwhile, it is worth mentioning that all the curves have different degrees of decline when factors become larger, indicating that a large factor size may probably cause overfitting and degrade the overall performance.

Table 3 shows the precise results in different factor size. We notice that our proposed SACF algorithm exceeds MLP with a maximum performance improvement of 9.73% in HR@10 and 5.72% in NDCG@10. To the contrast of TCF, SACF also outperforms with 4.13% and 2.58% respectively.

Table 3. Performance of HR@10 and NDCG@10 in different size of latent factors.

Full size table

We also evaluate the performance in Top-K recommendation. We fix embedding size (ES) to 64 and latent factor (LF) size to 8. The results presented in Fig. 4 show that, both HR@K and NDCG@K follow the same trend, i.e. MLP<TCF<SACF, where K ranges from 1 to 10. These findings further reveal that SACF has significantly performance advantages over the baselines.

Table 4. Performance of HR@10 and NDCG@10 in Top-K recommendation.

Full size table

The detailed performance data of Top-K recommendation is presented in Table 4. Normally, as the value of K increases, all the performance indicators are gradually improving. At the same time, the performance gap between SACF and the other two baselines is gradually expanding.

Experiment 2: Impact of Embedding Size. The size of embedding vector determines the feature description ability especially when we embed various kinds of input data. Towards this end, we further investigate the impact of embedding size, which is summarized in Fig. 5. The latent factor (LF) size is set to 8. The empirical evidence shows that the performance curves rise first and then stabilize. We speculate that increasing the embedding size can partly alleviate recommendation efficiency, but as the dimension continues to rise, it also brings the risk of overfitting and harms the recommendation performance.

Experiment 3: Performance Changes with Iterations. As the number of iterations increases, the parameters in the neural network will be updated numerous times, and the fitting effect goes from underfitting to overfitting. Figure 6 shows the recommendation performance of the algorithms of each iteration on our dataset. We can see that with more iterations, the performance first rises rapidly and then decreases gradually in both two metrics, which indicates that the more iterations may lead to overfitting.

5 Conclusion

In this work, we proposed an efficient recommendation algorithm called SACF for crowdsourced time-sync videos, which exploits the characteristics of the TSC data. By integrating the semantic embedding with collaborative filtering paradigm, SACF achieves much better performance compared to other algorithms on real datasets. In the future, we will continue to delve into this field. On the one hand, we can better model the user’s interest based on the rich emoji data in TSCs. On the other hand, we can also infer the user’s mood in real time according to the TSC that the user just sent. Such mood-aware data can definitely improve the performance in real-time recommender systems.

Notes

1.
https://www.youtube.com/.
2.
https://www.twitch.tv/.
3.
http://www.acfun.cn.
4.
https://www.bilibili.com.
5.
https://www.nicovideo.jp.
6.
Available at https://www.bilibili.com/video/av22135056.

References

Adomavicius, G., Tuzhilin, A.: Context-aware recommender systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 217–253. Springer, Boston, MA (2011). https://doi.org/10.1007/978-0-387-85820-3_7
Chapter MATH Google Scholar
Bayer, I., He, X., Kanagal, B., Rendle, S.: A generic coordinate descent framework for learning from implicit feedback. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1341–1350. International World Wide Web Conferences Steering Committee (2017)
Google Scholar
Chen, X., Zhang, Y., Ai, Q., Xu, H., Yan, J., Qin, Z.: Personalized key frame recommendation. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 315–324. ACM (2017)
Google Scholar
Covington, P., Adams, J., Sargin, E.: Deep neural networks for youtube recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems, pp. 191–198. ACM (2016)
Google Scholar
Ganu, G., Elhadad, N., Marian, A.: Beyond the stars: improving rating predictions using review text content. In: WebDB, vol. 9, pp. 1–6. Citeseer (2009)
Google Scholar
He, M., Ge, Y., Chen, E., Liu, Q., Wang, X.: Exploring the emerging type of comment for online videos: Danmu. ACM Trans. Web (TWEB) 12(1), 1 (2018)
Article Google Scholar
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. International World Wide Web Conferences Steering Committee (2017)
Google Scholar
He, X., Zhang, H., Kan, M.Y., Chua, T.S.: Fast matrix factorization for online recommendation with implicit feedback. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 549–558. ACM (2016)
Google Scholar
Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 263–272. IEEE (2008)
Google Scholar
Ikeda, A., Kobayashi, A., Sakaji, H., Masuyama, S.: Classification of comments on Nico Nico Douga for annotation based on referred contents. In: 2015 18th International Conference on Network-Based Information Systems (NBiS), pp. 673–678. IEEE (2015)
Google Scholar
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. (TOIS) 20(4), 422–446 (2002)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 8, 30–37 (2009)
Article Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Liu, H., He, J., Wang, T., Song, W., Du, X.: Combining user preferences and user opinions for accurate recommendation. Electron. Commer. Res. Appl. 12(1), 14–23 (2013)
Article Google Scholar
Lv, G., Xu, T., Chen, E., Liu, Q., Zheng, Y.: Reading the videos: temporal labeling for crowdsourced time-sync videos based on semantic embedding. In: AAAI, pp. 3000–3006 (2016)
Google Scholar
McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 165–172. ACM (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Mnih, A., Salakhutdinov, R.R.: Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems, pp. 1257–1264 (2008)
Google Scholar
Ping, Q.: Video recommendation using crowdsourced time-sync comments. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 568–572. ACM (2018)
Google Scholar
Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the Twenty-fifth Conference on Uncertainty in Artificial Intelligence, pp. 452–461. AUAI Press (2009)
Google Scholar
Tan, Y., Zhang, M., Liu, Y., Ma, S.: Rating-boosted latent topics: understanding users and items with ratings and reviews. In: IJCAI, pp. 2640–2646 (2016)
Google Scholar
Tang, D., Qin, B., Liu, T.: Learning semantic representations of users and products for document level sentiment classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, pp. 1014–1023 (2015)
Google Scholar
Tang, D., Qin, B., Liu, T., Yang, Y.: User modeling with neural network for review rating prediction. In: IJCAI, pp. 1340–1346 (2015)
Google Scholar
Verbert, K., et al.: Context-aware recommender systems for learning: a survey and future challenges. IEEE Trans. Learn. Technol. 5(4), 318–335 (2012)
Article Google Scholar
Wu, B., Zhong, E., Tan, B., Horner, A., Yang, Q.: Crowdsourced time-sync video tagging using temporal and personalized topic modeling. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 721–730. ACM (2014)
Google Scholar
Wu, Y., Ester, M.: Flame: a probabilistic model combining aspect based opinion mining and collaborative filtering. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 199–208. ACM (2015)
Google Scholar
Xian, Y., Li, J., Zhang, C., Liao, Z.: Video highlight shot extraction with time-sync comment. In: Proceedings of the 7th International Workshop on Hot Topics in Planet-Scale mObile Computing and Online Social neTworking, pp. 31–36. ACM (2015)
Google Scholar
Zhang, S., Yao, L., Sun, A.: Deep learning based recommender system: a survey and new perspectives. arXiv preprint arXiv:1707.07435 (2017)
Zhang, W., Yuan, Q., Han, J., Wang, J.: Collaborative multi-level embedding learning from reviews for rating prediction. In: IJCAI, pp. 2986–2992 (2016)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Key R&D Program of China under Grant 2018YFB0204100, the National Natural Science Foundation of China under Grant 61572538, Guangdong Special Support Program under Grant 2017TX04X148, Hong Kong Innovation and Technology Commission under Grant ITS/319/17, Australia Research Council under Grant DE180100950.

Author information

Authors and Affiliations

School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006, China
Zhanpeng Wu, Yan Zhou & Di Wu
Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, China
Zhanpeng Wu, Yan Zhou & Di Wu
Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, 2109, Australia
Yipeng Zhou
School of Nursing, The Hong Kong Polytechnic University, Hong Kong, China
Jing Qin

Authors

Zhanpeng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Di Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yipeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jing Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Di Wu .

Editor information

Editors and Affiliations

Novosibirsk State Technical University, Novosibirsk, Russia
Maxim Bakaev
Erasmus University Rotterdam, Rotterdam, The Netherlands
Flavius Frasincar
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
In-Young Ko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, Z., Zhou, Y., Wu, D., Zhou, Y., Qin, J. (2019). Crowdsourced Time-Sync Video Recommendation via Semantic-Aware Neural Collaborative Filtering. In: Bakaev, M., Frasincar, F., Ko, IY. (eds) Web Engineering. ICWE 2019. Lecture Notes in Computer Science(), vol 11496. Springer, Cham. https://doi.org/10.1007/978-3-030-19274-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-19274-7_13
Published: 26 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19273-0
Online ISBN: 978-3-030-19274-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Crowdsourced Time-Sync Video Recommendation via Semantic-Aware Neural Collaborative Filtering

Abstract

Similar content being viewed by others

Video clip recommendation model by sentiment analysis of time-sync comments

Personalized time-sync comment generation based on a multimodal transformer

Learning from Audience Intelligence: Dynamic Labeled LDA Model for Time-Sync Commented Video Tagging

Keywords

1 Introduction

2 Related Work