Keywords

1 Introduction

The past decade has witnessed a dramatic increase of online videos, including online TV episodes, online movies, user-generated contents, livecast, and etc. The traffic generated by world-leading online video websites (e.g., Youtube, Netflix, Tencent Video, Hulu) has dominated the whole Internet backbone. In our daily life, users watch online videos for learning, news, and funny stuff. Normally, users tend to make comments on video after watching. In recent years, there emerges a new type of video comments, called Time-Sync Comments (or Danmu, bullet-screen comments), which allow a user to make comments on video shots in a real-time manner. Time-sync comments are flying across the screen and people who are watching the same video also see the flying comments. To date, the service of time-sync comments has been provided by quite a few online video websites, such as YouTubeFootnote 1, TwitchFootnote 2, AcFunFootnote 3, BiliBiliFootnote 4, NicoNicoFootnote 5, and so on. In Fig. 1, we show an example of a video clipFootnote 6 with time-sync comments.

Fig. 1.
figure 1

Example of a crowdsourced time-sync video.

Different from conventional video comments, time-sync comments are synchronized with a video’s playback time. It provides a possibility for viewers who are watching the same video to share their watching experience and interact with each other. Latent features can be extracted from time-sync comments to provide more detailed information on user interests. For instance, viewers who wrote comments in nearby playback time positions are likely to have some kinds of similarity or association (e.g., like or dislike a specific video shot). Intuitively, those viewers can be categorized into the same group with implicit similar preferences. Moreover, a continuous bundle of time-sync comments can describe video contents to some extent. Such kind of information is useful for video recommendation to further improve user experience.

In this paper, we propose a new video recommendation algorithm for crowdsourced time-sync videos, which is called SACF (Semantic-Aware Collaborative Filtering). The basic idea of SACF is to exploit the temporal relationship between time-sync comments and video frames, and extract latent semantic representations of time-sync comments to provide more accurate video recommendation. Our proposed algorithm can model user preferences in the frame level. In summary, our main contributions in this paper can be listed as below:

  • We propose a novel video recommendation algorithm called SACF to improve the performance of crowdsourced time-sync videos. Our algorithm extends traditional video recommendation algorithms by embedding latent semantic representations extracted from TSCs.

  • To better utilize interaction patterns, we integrate all the representations with a multi-layer perceptron (MLP) model. By embedding extra semantic-aware information, our approach can easily achieve similar user and item interest filtering, and mitigate the cold-start problem.

  • We also validate our proposed algorithm using a real TSC dataset obtained from the BiliBili video website. The experiment results show that our algorithm significantly outperforms other baselines by up to 9.73% in HR@10 and 5.72% in NDCG@10.

The rest of our paper is organized as follows: we first introduce the relevant work in Sect. 2. We describe the details of our algorithm in Sect. 3. The dataset and experiments are introduced in Sect. 4. Finally we conclude the whole paper and discuss our future work in Sect. 5.

2 Related Work

The topic of recommender systems has been extensively studied in the past years. He et al. [8], Koren et al. [13], Mnih and Salakhutdinov [19] have shown the excellent ability of matrix factorization model in the field of rating prediction problem. In addition to such classic research question, Top-K recommendation using implicit feedback are also worthy of attention. Hu et al. [9] and Rendle et al. [21] are the masterpieces of them. Moreover, in recent years, neural network models [29] have gained widespread attention because of their ability to easily fit multi-dimensional features and learn nonlinear relationships, which are concerned by Covington et al. [4] and He et al. [7].

With the development of this field, in order to improve the effectiveness in user preference modeling, the incorporation of contextual information has attracted major research interests, such as the work of Adomavicius and Tuzhilin [1] and Verbert et al. [25]. User review is one of the most effective contextual information to model user preference and these algorithms are receiving more and more research attention. Tang et al. [23] aims to incorporate user- and product- level information for document sentiment classification. Tang et al. [24] leverages the reviews for user modeling and predicts the rating of user review. In addition, lots of the review-based models focus on enhancing the effectiveness of rating prediction, like Ganu et al. [5], McAuley and Leskovec [17]. To achieve better overall recommendation performance, Liu et al. [15], Tan et al. [22], Wu and Ester [27] extract the user opinions from review text and combine these information into the conventional models for higher recommendation accuracy. Besides, Zhang et al. [30] integrates traditional matrix factorization technology with word2vec model proposed by Mikolov et al. [18] for precise modeling. Notably, these proposed models mainly focus on rating prediction problem, and learn knowledge from traditional reviews which are longer than time-sync comments and contain richer content and semantics.

Time-synchronized comment is first introduced by Wu et al. [26] for video shot tagging. Recently, as an emerging type of user-generated comment, TSC has many practical properties to describe a video in frame-level. As mentioned by He et al. [6], herding effect and multiple-burst phenomena make TSC significant different from the traditional reviews, and it also proves that TSCs have a great correlation with video frame content and user reactions. Thus this emerging comment type is worthy of being used for video highlight shot extraction and content annotation, presented by Xian et al. [28] and Ikeda et al. [10]. Besides, an increasing number of models are trying to label videos based on TSCs, such representative work as Lv et al. [16]. Chen et al. [3] and Ping [20] further leverages the TSCs to extract the features of video frames and applies it to key frame recommendation. However, these methods are clinging to the video shot itself and do not consider the co-occurrence among TSCs.

Compared with the aforementioned approaches, we conduct a semantic-aware collaborative filtering algorithm, which can efficiently extract latent representations from TSCs within the videos and achieve significantly performance improvement in Top-K recommendation task.

3 Design of Semantic-Aware Video Recommendation Algorithm

By fusing latent semantic representation from video TSCs and traditional interaction paradigm as a composite entity, we propose our semantic-aware collaborative filtering (SACF) video recommendation algorithm, which has flexibility and non-linearity profited by exploiting multi-layer perceptron as a fundamental framework. We will discuss the details of our algorithm in the following subsections.

3.1 Problem Definition

We first define the problem formally. Suppose there are N users \({{\varvec{u}}} = \{u_1,u_2,\ldots ,\) \(u_N\}\), M videos \({{\varvec{i}}} = \{i_1,i_2,\ldots ,i_M\}\) and T TSCs \({{\varvec{c}}} = \{c_1,c_2,\ldots ,c_T\}\). The TSC written by user u leaving in video i at video time t is defined as a 2-tuple \({<}u_i^t, c_i^t{>}\). Thereby, for each video \(i \in {{\varvec{i}}}\), we can obtain two different sequences, that is, TSC writer sequence \({{\varvec{s}}}_{i}^{(u)} = (u_{i}^{1},u_{i}^{2},\ldots ,u_{i}^{T_i})\) and TSC content sequence \({{\varvec{s}}}_{i}^{(c)} = (c_{i}^{1},c_{i}^{2},\ldots ,c_{i}^{T_i})\), where \(T_i\) is the amount of TSCs in video i and each element within the sequence is ranked by its video time.

Suppose the representation of user u is defined as \({{\varvec{w}}}_u\), and the representation of video i is defined as \({{\varvec{d}}}_i\). The user semantic representations \({{\varvec{W}}}=\{ {{\varvec{w}}}_u | u \in {{\varvec{u}}} \}\) and video semantic representations \({{\varvec{D}}}=\{ {{\varvec{d}}}_i | i \in {{\varvec{i}}} \}\) are learned from the sequencing data \({{\varvec{s}}}_{i}^{(u)}\) and \({{\varvec{s}}}_{i}^{(c)}\), which exploit the improved word embedding technology.

Given all the historical user-video interaction data \(\mathcal {D}=\{ \mathcal {O},\mathcal {O}^{-} \}\), where \(\mathcal {O}\) and \(\mathcal {O}^{-}\) denote the positive and negative instances, respectively. For a user u and a set of corresponding unseen videos, our goal is to find a interaction function \(f(\cdot )\) to rank all the unseen videos based on how much s/he like the video. The top K most likely to watch videos are the final results recommended to user u. For reference, we list the notations used throughout the algorithm in Table 1.

Table 1. Notations used in SACF algorithm

3.2 Latent Semantic Representation

In this section, we explain the methods to capture latent semantic representation, which aim to extract the similarity lurked in users and videos. Before digging into the details, to better model the information of Internet slangs, we need to conduct a few data preprocessing work:

  • Different character components within a TSC may indicate different meanings. Thus, we will split a complete TSC into multiple substrings, where each substring represents a series of consecutive characters of the same type and these substrings will be treated as part of the TSC set \({{\varvec{c}}}\) as well. These character types can be English letter, pure number and Chinese character. For instance, TSC “awesome!!2333” will be treated as two TSCs, i.e.“awesome" and “2333” (laughter).

  • Moreover, the excessive TSCs will obviously impair the performance of the algorithm. Thereby, those Chinese substrings that are too long will also be segmented (the threshold of segmentation in our experiments is more than five consecutive Chinese characters).

Inspired by [14, 18], we propose the improved word embedding methods to learn representations. In this schema, each user is mapped to a unique latent vector \({{\varvec{w}}}_{u}\) and each vector is represented as one column of the matrix \({{\varvec{W}}}\) where \({{\varvec{W}}}\) indicates the user semantic representation matrix. Given the sequence of user \({{\varvec{s}}}_{i}^{(u)}\) from a finite user set \({{\varvec{u}}}\), the objective function aiming at maximizing is formulated as follows,

$$\begin{aligned} \frac{1}{T_i} \sum _{t=k}^{T_i-k} \sum _{-k\le j\le k, j\ne 0} \log p(u_{i}^{t} | u_{i}^{t+j}) \end{aligned}$$
(1)

where k is the context window size and \(p(\cdot )\) is the softmax function,

$$\begin{aligned} p(u_{i}^{t} | u_{i}^{t+j}) = \frac{\exp ( {{{\varvec{w}}}_{u_{i}^{t}}}^T \cdot {{\varvec{w}}}_{u_{i}^{t+j}})}{\sum _{u^{'} \in {{\varvec{u}}}} \exp ( {{{\varvec{w}}}_{{u_{i}^{t+j}}}}^T \cdot {{\varvec{w}}}_{u^{'}} )} \end{aligned}$$
(2)

After the training converges, users who have the similar watching patterns will be projected to a similar position in the vector space. We leave these user semantic representations \({{\varvec{W}}}=\{ {{\varvec{w}}}_u | u \in {{\varvec{u}}} \}\) for later use. Likewise, TSC and video are mapped to a unique latent vector \({{\varvec{w}}}_c\) and \({{\varvec{d}}}_i\), respectively. Each vector is a column of their respective matrix, \({{\varvec{W}}}_c\) and \({{\varvec{D}}}\) where \({{\varvec{D}}}\) indicates the video semantic representation matrix. We will further leverage the information within \({{\varvec{W}}}\) and \({{\varvec{D}}}\) in the next subsection. Given the sequence of \({{\varvec{s}}}_{i}^{(c)}\) from a finite TSC set \({{\varvec{c}}}\), the objective function is defined as,

$$\begin{aligned} \frac{1}{T_i} \sum _{t=k}^{T_i-k} \sum _{-k\le j\le k, j\ne 0} \log p(c_{i}^{t} | c_{i}^{t+j}) \end{aligned}$$
(3)

Similarly, the softmax function can be formulated as,

$$\begin{aligned} p(c_{i}^{t} | c_{i}^{t+j}) = \frac{\exp ( {{{\varvec{z}}}_{c_{i}^{t},i}}^T \cdot {{\varvec{z}}}_{c_{i}^{t+j}})}{\sum _{c^{'} \in {{\varvec{c}}}\ and\ i^{'} \in {{\varvec{i}}}} \exp ( {{{\varvec{z}}}_{{c_{i}^{t+j}},i}}^T \cdot {{\varvec{z}}}_{c^{'},i^{'}} )} \end{aligned}$$
(4)

where the latent vector \({{\varvec{z}}}\) is constructed from \({{\varvec{w}}}_c\) and \({{\varvec{d}}}_i\). In particular, the vector \({{\varvec{z}}}\) is the sum of \({{\varvec{w}}}_c\) and \({{\varvec{d}}}_i\). Our ultimate goal is to obtain the video semantic representation set \({{\varvec{D}}}=\{ {{\varvec{d}}}_i | i \in {{\varvec{i}}} \}\) for the next step.

3.3 Algorithm Design

We now elaborate the SACF algorithm in details. Our approach employs a deep neural network, which is empowered the capability to learn the non-linear interactions from input data and can be easily embedded with extra features. Therefore, in addition to treat the identify of a user and a video in pure collaborative filtering as basic input feature, SACF also transforms the video TSC corpus to two kinds of representations, which are generated from user and video latent semantic information. As shown in Fig. 2, on the top of the input layer, each user and video is mapped to two corresponding vectors, i.e. identification dense vector and latent semantic representation vector. The generation methods of latent semantic representation have been fully explained in Sect. 3.2. And then such a 4-tuple of embedding vectors is fed into a multi-layer neural network, where each layer is combined with fully connection. The final output layer represents a predictive probability \(\hat{r}_{ui}\), which is trained by minimizing the binary cross-entropy loss between \(\hat{r}_{ui}\) and its target value \(r_{ui}\).

Fig. 2.
figure 2

SACF model structure for time-sync video recommendation.

Consequently, we can further formulate the SACF algorithm as,

$$\begin{aligned} \hat{r}_{ui} = f( {{\varvec{P}}}^T {{\varvec{v}}}_u, {{\varvec{Q}}}^T {{\varvec{v}}}_i, {{\varvec{W}}}^T {{\varvec{v}}}_u, {{\varvec{D}}}^T {{\varvec{v}}}_i, |{{\varvec{P}}},{{\varvec{Q}}},{{\varvec{W}}},{{\varvec{D}}}, \varTheta ) \end{aligned}$$
(5)

where \({{\varvec{P}}}\) and \({{\varvec{Q}}}\) denote the latent factor matrix for users and videos respectively. \({{\varvec{W}}}\) is the user semantic representation matrix, and analogously, \({{\varvec{D}}}\) is the video semantic representation matrix which extracts the latent features from the TSC corpus. \({{\varvec{v}}}_u\) and \({{\varvec{v}}}_i\) separately denote the index vector of user u and video i. \(\varTheta \) represents the parameters of the interaction function \(f(\cdot )\).

As mentioned above, the function \(f(\cdot )\) can be further defined as a multi-layer neural network,

(6)

where \(\sigma \) is the mapping function of output layer and \(\varPhi _L\) denotes the \(L^{th}\) hidden layer in neural network. More specifically, we formulate each layer as follows,

$$\begin{aligned} \begin{aligned} {{\varvec{x}}}_1&= \varPhi _1( {{\varvec{p}}}_u, {{\varvec{q}}}_i, {{\varvec{w}}}_u, {{\varvec{d}}}_i )= \begin{bmatrix} {{\varvec{p}}}_u&{{\varvec{q}}}_i&{{\varvec{w}}}_u&{{\varvec{d}}}_i \end{bmatrix}^{T}\\ {{\varvec{x}}}_2&= \varPhi _2({{\varvec{x}}}_{1}) = g_2({{\varvec{A}}}_2^T {{\varvec{x}}}_{1} + {{\varvec{b}}}_1)\\&\ldots \\ {{\varvec{x}}}_L&= \varPhi _L({{\varvec{x}}}_{L-1}) = g_L({{\varvec{A}}}_L^T {{\varvec{x}}}_{L-1} + {{\varvec{b}}}_L)\\ \hat{r}_{ui}&= \sigma ({{\varvec{h}}}^T {{\varvec{x}}}_L) \end{aligned} \end{aligned}$$
(7)

where \({{\varvec{A}}}_L\), \({{\varvec{b}}}_L\) and \(g_L\) respectively denote the weight matrix, bias vector and activation function of the \(L^{th}\) layer. \({{\varvec{p}}}_u\) and \({{\varvec{q}}}_i\) are the dense vector embedded with the one-hot encoding of user u and video i. \({{\varvec{w}}}_u\) and \({{\varvec{d}}}_i\) are latent semantic representation generated from crowdsourced TSC data. \({{\varvec{h}}}\) represents the weight vector of the output layer.

To endow a probabilistic explanation for SACF, we need to constraint \(\hat{r}_{ui}\) in range of [0, 1], which can be achieved by adopting Logit or Probit as an activation function in output layer. We finally optimize SACF by minimizing the binary cross-entropy loss, and the objective function of SACF can be formulated as,

$$\begin{aligned} \mathcal {L}= -\sum _{(u,i) \in \mathcal {O} \cup \mathcal {O}^{-}} r_{ui} \log \hat{r}_{ui} + (1-r_{ui}) \log (1-\hat{r}_{ui}) \end{aligned}$$
(8)

where \(\mathcal {O}\) denotes the set of observed interactions, and \(\mathcal {O}^{-}\) denotes the set of unobserved interactions. To improve the training efficiency, \(\mathcal {O}^{-}\) can be regarded as the negative instances sampled from all the unobserved interactions.

figure a

The semantic-aware neural collaborative filtering algorithm is illustrated in Algorithm 1. The algorithm can be considered as a two-stage process. Precisely, we pretrain the semantic representation of users and items in the first stage and embed it with the user-video interaction data in the second stage to maintain a holistic recommendation task. The algorithm works when the video has a series of TSC data. These data contain meta information about users and videos, and thus even if the user rarely sends any TSC, the algorithm can still leverage the implicit information to get the probability of watching a video.

4 Performance Evaluation

To demonstrate the superiority of our method, a time-sync video dataset crawled from Bilibili website is utilized for performance evaluation. We will discuss an overview of the dataset and experiment settings before presenting the experimental results.

4.1 Dataset Overview

Bilibili is one of the most popular TSC video sharing websites in China, and leads a trend of video interaction via TSC. We collected the video meta information and its corresponding TSC data from Bilibili website till December 15th, 2018. Note that the video TSC data we collected is only a part of fully historical data, because the platform will periodically remove the stale TSC data from the TSC pool and remain the latest TSC data. For the sake of reflecting the experimental results more significantly, we mainly focus on the data of gaming category, which attracts the most traffic on Bilibili. Based on these premises, our collected dataset totally contains 57,294 users and 2,637 videos. All these videos include 836,806 TSCs and 3,483 user-generated tags. More detailed statistic analyses are presented in Table 2.

Table 2. Overall statistics of our time-sync comment dataset

4.2 Experiment Settings

Evaluation Metrics. To evaluate the performance of TSC video recommendation, we employ two widespread adopted metrics, that is, Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) [11]. These two metrics can respectively measure the classification and ranking performance in recommendation problem.

Baselines. Besides SACF method, we also implement two other algorithms for comparison, which are described as below,

  • MLP is proposed by [7]. It is a pure collaborative filtering method, which uses only the identifies of user and item as embedding features. Previous work has shown its strong generalization ability benefited from DNN model.

  • TCF is a variant of SACF. In contrast to SACF, TCF embeds the user-generated video tag information instead of the TSC information in SACF. It is a highly competitive baseline for video recommendation that fuses with conventional content features.

Parameter Settings. For better generality and comparison, we chose the widely used leave-one-out [2, 7, 8] evaluation schema, which holds the latest interaction as a test case and use the rest as training set. All the algorithms are optimized by the cross-entropy function defined in Eq. (8), where we randomly sample 4 negative instances for each positive instance. We use Adam [12] as optimizer to train algorithms, and fix the batch-size and learning rate at 256 and 0.001. Besides, in fairness, the number of hidden layers is set to 4 in all the experiments.

4.3 Experiment Results

Experiment 1: Performance Comparison. Since the size of last hidden layer implies the learning capability in DNN models, we can evaluate the performance in different factor size to achieve comprehensive comparison. Figure 3 illustrates that SACF outperforms other two baselines with the factors of 8, 16, 32 and 64 in both metrics when embedding size (ES) is set to 64. And the overall performance trend is presented as MLP<TCF<SACF. Meanwhile, it is worth mentioning that all the curves have different degrees of decline when factors become larger, indicating that a large factor size may probably cause overfitting and degrade the overall performance.

Fig. 3.
figure 3

Performance comparison in different size of latent factors.

Table 3 shows the precise results in different factor size. We notice that our proposed SACF algorithm exceeds MLP with a maximum performance improvement of 9.73% in HR@10 and 5.72% in NDCG@10. To the contrast of TCF, SACF also outperforms with 4.13% and 2.58% respectively.

Table 3. Performance of HR@10 and NDCG@10 in different size of latent factors.

We also evaluate the performance in Top-K recommendation. We fix embedding size (ES) to 64 and latent factor (LF) size to 8. The results presented in Fig. 4 show that, both HR@K and NDCG@K follow the same trend, i.e. MLP<TCF<SACF, where K ranges from 1 to 10. These findings further reveal that SACF has significantly performance advantages over the baselines.

Fig. 4.
figure 4

Performance of Top-K recommendation where ranges K from 1 to 10.

Table 4. Performance of HR@10 and NDCG@10 in Top-K recommendation.

The detailed performance data of Top-K recommendation is presented in Table 4. Normally, as the value of K increases, all the performance indicators are gradually improving. At the same time, the performance gap between SACF and the other two baselines is gradually expanding.

Experiment 2: Impact of Embedding Size. The size of embedding vector determines the feature description ability especially when we embed various kinds of input data. Towards this end, we further investigate the impact of embedding size, which is summarized in Fig. 5. The latent factor (LF) size is set to 8. The empirical evidence shows that the performance curves rise first and then stabilize. We speculate that increasing the embedding size can partly alleviate recommendation efficiency, but as the dimension continues to rise, it also brings the risk of overfitting and harms the recommendation performance.

Fig. 5.
figure 5

Performance comparison in different size of embedding vector.

Fig. 6.
figure 6

Recommendation performance in different number of iterations.

Experiment 3: Performance Changes with Iterations. As the number of iterations increases, the parameters in the neural network will be updated numerous times, and the fitting effect goes from underfitting to overfitting. Figure 6 shows the recommendation performance of the algorithms of each iteration on our dataset. We can see that with more iterations, the performance first rises rapidly and then decreases gradually in both two metrics, which indicates that the more iterations may lead to overfitting.

5 Conclusion

In this work, we proposed an efficient recommendation algorithm called SACF for crowdsourced time-sync videos, which exploits the characteristics of the TSC data. By integrating the semantic embedding with collaborative filtering paradigm, SACF achieves much better performance compared to other algorithms on real datasets. In the future, we will continue to delve into this field. On the one hand, we can better model the user’s interest based on the rich emoji data in TSCs. On the other hand, we can also infer the user’s mood in real time according to the TSC that the user just sent. Such mood-aware data can definitely improve the performance in real-time recommender systems.