Discriminative and efficient non-local attention network for league of legends highlight detection

With the growing popularity of eSports, video highlight detection, which encapsulates the most informative parts in a few seconds, has become a critical part of live competition. However, learning the spatial–temporal dependency efficiently and discriminatively in video highlight detection for league of legends (LoL) is a critical problem. In this study, to address these existing problems, we propose a novel discriminative and efficient non-local attention network (DENAN) for LoL highlight detection. In particular, both spatial and temporal dependencies are learned using an end-to-end lightweight trainable framework. An auxiliary triplet loss is used in discriminative training to learn robust LoL video feature representations and improve DENAN’s performance. Our experimental results on the NLACS and LMS datasets show the effectiveness of our method in terms of performance and computation cost.


Introduction
Recently, eSports has become increasingly popular among players and viewers. The league of legends (LoL), one of the most famous online eSports, has attracted numerous players and fans around the world. According to a Statista report, the LoL World Championships peaked at approximately 46 million concurrent viewers in 2020. Highlight replay is an important part of live streaming because it shows the most amazing fighting fragments during the live streaming. However, current LoL highlight replay largely manual. Therefore, it is critical to implement automatic and efficient highlight detection in LoL.
As an attempt to automate highlight generation, video highlight detection has attracted the attention of both academics and industry. The goal of highlight detection is to generate a short video clip from the candidate video that captures a user's primary attention or interest [1]. Existing video highlight detection methods could mainly be divided into two categories: structure-driven [2,3] and keyframe-based methods [4][5][6]. Researchers using structure-driven methods have focused on a well-defined data structure in the detected video, such as audience cheering or chatting, score changes, or other special events. The main goal of keyframe-based methods is to optimize the feature representation from the frame level to the clip level. Key-frames are extracted using the histogram of oriented gradient (HOG), scale-invariant feature transform (SIFT), and cluster algorithms, and then video browsing is performed near the selected keyframes, smoothing out video highlights. With the recent advancements in convolutional neural network (CNN), CNN-based methods have been extensively used in video highlight detection [4,[6][7][8].
To learn a robust feature representation for video highlight detection, some studies have considered the temporal dependency between video frames. This study [6] used long short-term memory (LSTM) [9] to model the variable-range Fig. 1 LoL videos are parsed into a sequence of frames and then fed into DENAN. Each frame will be classified to 1 or 0 corresponding to highlight or not. Finally, several classified consecutive frames with highlights are combined into the LoL video highlights temporal dependency among video frames to generate both representative and compact video summary. A two-layer recurrent neural network (RNN) was used to construct a hierarchical RNN, which exploits the long temporal dependency among frames [10]. However, learning discriminative and robust video feature representation solely based on temporal dependency is insufficient, and it is natural to model the spatial-temporal dependency inside the video. In a previous study, the features extracted from the spatial and temporal streams are combined to develop a novel pairwise deep ranking model [7]. Another study proposed a deep ranking model to produce a score map for each video segment based on the spatial and temporal stream. For the task of video highlight detection in LoL, we intend to determine the time when the audience is most interested in the live streaming from the perspective of time dimension. Additionally, it is important to determine the position where the audience will pay more attention. Therefore, explicitly learning the spatial-temporal dependency for the LoL highlight detection task is appealing.
A few attempts have recently been made to integrate attention-based methods into video highlight detection, with the goal of weighting the importance of different frames. The attention mechanism learned with Bi-LSTM was used to model the importance of different source frames for video highlight detection [11]. A self-attention mechanism was used to replace complex LSTM, which captures the temporal relationship between input frame features before computing the weighted average of all input features [12]. However, while promising performances have been reported, most current attention-based methods only use complex LSTM to learn the temporal dependency. Therefore, a model for capturing the spatial-temporal dependency with a low computational cost is required for LoL highlight detection.
In this study, we attempt to capture the spatial-temporal feature representation in the candidate video with an attention mechanism for an efficient and discriminative endto-end trainable LoL video highlight detection task. Based on another study [13], we propose a discriminative and efficient non-local attention network (DENAN), as shown in Fig. 1, which incorporates the non-local attention mechanism into ShuffleNetV2 [14], a light-weight CNN for classification. The non-local attention module refines the LoL video sequence representations by generating meanweighted attention to the features of different spatial and temporal locations in the sequences. DENAN explores the spatial-temporal diversity of LoL video sequence, and discriminatively and efficiently learns the sequence representation. The main contributions of this study are as follows: 1. In this study, we propose an end-to-end framework for learning both spatial and temporal dependencies in a discriminative and efficient non-local attention network for LoL video highlight detection. 2. We significantly reduce the computation cost for LoL video highlight detection task while significantly improving performance.

Experimental results on the NALCS and LMS datasets
show that our proposed method outperforms existing methods in terms of efficiency and accuracy.

Related work
In this section, we focus on recent methods related to our work, which includes video highlight detection, eSports video highlight detection, and attention-based video highlight detection.

Video highlight detection
Research on video highlight detection is mainly performed along two directions: (a) keyframe-based methods and (b) structure-driven methods. Keyframe-based methods use a subset of representative keyframes from the original video to generate highlights. Most early video highlight detection methods focus on extracting keyframes independently and using them as a classification task. Borth et al. proposed a keyframe extraction approach in which the video is first segmented into shots using shot boundary detection, and then obtained keyframes using the k-means algorithm [15]. Lin et al. used a context-specific highlight support vector machines (SVM) model to summarize video sequences without watching the entire video by predicting the contextual information of each video segment [16]. Even though these methods achieved remarkable performance, they only extract low-level features and ignore the temporal dependency, which describes the relationship between highlight and non-highlight frames. Unlike keyframe methods, structure-driven methods exploit a well-defined data structure in the detected video, such as audience cheering and chatting, score changes, or other special events. Therefore, structure-based methods are suitable for sports video highlight detection, and they have attracted the attention of many researchers [6,17,17]. Zhao et al. proposed a highlight detection model based on audio energy and motion activity. Hsieh et al. proposed a more flexible solution for finding important and meaningful events in sports games by analyzing the messages shared between users on microblog services [18]. While sports video highlight detection has improved, these methods rely on audio, textual, and psychological data, which are not always easy to obtain.

eSports video highlight detection
With the recent rapid development of eSports, video highlight detection has attracted the attention of both industry and academia. Fu et al. proposed a CNN-LSTM model for LoL that combine visual features and real-world audience discourse. Song et al. proposed a cascaded prediction approach for learning convolution filters of visual effects for detecting video highlights in Heroes of the Storm, LoL, and Dota2 [19]. Wang et al. [20] proposed a multi-stream framework to fuse spatial, temporal information, and audio features extracted from Honor of King videos.
Recent eSports video highlight detection attempted to address these problems from a cross-modal perspective. However, for real-world applications, cross-modal information, such as audience chat or audio signal, is either difficult to obtain or requires additional computation cost. Therefore, we attempt to consider eSports video highlight detection using only visual features.

Attention-based video highlight detection
The objective of attention-based video highlight detection is to find what the user is most paying attention to, which is highly correlated with highlights. Ma et al. presented a generic framework in which computational attention models based on the modeling of viewer's attention were used [21]. Ma et al. presented a generic framework for a user attention model, which estimates the how much attention viewers may pay to video content. Ejaz et al. proposed an efficient visual attention model based on the key frame extraction method and reduced the computational cost using the temporal gradient based on dynamic visual saliency detection [22]. Additionally, some studies integrated spatial-temporal clues into attention-based video highlight detection, with the goal of determining when and where users are most interested. A novel 3D dimensional attention model was proposed, which can automatically localize the key elements in a video without any extra supervised annotations [23].
A self-attention mechanism was proposed to model the long-range dependency in machine translation [24]. Inspired by self-attention mechanism [24] and the non-local means algorithm [25], Wang et al. proposed the non-local attention [13], which computes the response at one position as a weighted sum of the features at all positions, capturing long-range spatial-temporal dependency for video representation. Our work is similar to [23], but [23] computes coarse-grained spatial temporal attention in which the attention matrix only models the relationship between different channels and shares along spatial and temporal dimensions. Conversely, DENAN can capture a fine-grained relationship at the pixel level, capturing the relationship at all positions along spatial and temporal dimensions.

Methods
In this section, we first describe the overview of our proposed DENAN. Then, each sub-module of DENAN will be described in detail. Figure 2 shows an overview of our proposed network, which consists of three parts: (1) a LoL video encoder, which is used to convert LoL video sequences into deep feature representations. ShuffleNetV2 [14] is used as the encoder, (2) non-local attention module, which is used to capture both spatial-temporal and long-range dependencies in LoL video Triplet loss (TL) [26] and cross entropy (CE) loss are used to optimize the entire framework.

Video encoder for LoL
We adopt an efficient video encoder with low computation costs to meet the requirement for a real-time video highlight detection system. Inspired by the recent studies [14,27], which designed an efficient and light-weight 2D universal CNN, we adopt ShuffleNetV2 [14] (Fig. 3) as a frame-level video encoder for LoL.
The number of feature channels in light-weight network is limited by the computing resources available. Compared with ShuffleNetV1 [27], the main improvement of ShuffleNetV2 [14] is channel split operation. Particularly, the total number of channels of the input feature map is divided into one branch with C channels and the other with C − C channels (C is the total number of channels). After a three-layer convolution operation for each branch, the features of the two branches are combined. Then, a channel shuffle operation is applied to enable two branches of information to interact.
Given a set of video sequences , where X i denotes one LoL video sequence and N represents the total number of video sequences. Each sequence contains T frames, where T is the video sequence length. Frame-by-frame video images for one sequence are input to the video encoder to obtain a set of frame-level feature sequences , where x t i represent the feature map for tth image in ith sequence. Then, the output from the video encoder is fed into the nonlocal attention module to further capture the spatial-temporal features of LoL video.

Spatial-temporal feature extraction using non-local attention module
The input of non-local attention module are the frame-level feature sequences . Following the non-local form proposed in [13,25], the non-local operation could be defined as follows: where χ i represent the extracted feature sequences, j represent the computed position index of the response, k represent the possible positions of all input feature sequences in both spatial and temporal dimensions, s χ i, j , χ i, k denotes the relationship between position j and position k of the input feature sequences, g(χ i, k ) computes the feature representation of input χ i at position k, and N (χ i ) is the normalization factor. We design a non-local attention module based on the nonlocal operation [13] to capture spatial and temporal pixel dependencies in LoL video (Fig. 4). The definition of nonlocal operation is similar to Eq. (1), which captures the spatial and temporal long-range pixel relationship. Then, the definition of non-local attention is defined as follows: where ∼ Fi, j represent a specific position element of ∼ Fi on position j, k are all possible positions of χ i to be computed. Here, we project χ i, j and χ i, k to embedding space using a linear transformation. Therefore, θ χ i, j = W θ χ i, j , φ χ i, k = W φ χ i, k , and g χ i, j = W g χ i, j , where W θ , W φ , and W g are weights to be learned. Equation (2) is similar to the self-attention mechanism proposed in [24].
To project ∼ Fi into the original space, the non-local operation is wrapped into non-local attention module, as shown in Fig. 4, which is defined as follows: where W z represent a linear projection matrix, which is implemented using a 1 × 1 × 1 convolution, and Z i represent the final output of non-local attention module. A video highlight is a duration of continuous frames that are not time-independent. However, determining whether a particular frame is a highlight must consider the long-range pixel dependency in the spatial dimension. Therefore, we should consider the relationship between LoL videos from both spatial and temporal dimensions.
To further refine the spatial-temporal information of the learned features, we use global average pooling (GAP) to aggregate information in spatial and temporal dimensions as follows: where

Loss function for training DENAN
To extract discriminative and robust features from DENAN, we combined CE loss with TL [26] to train our framework for LoL video highlight detection. The CE loss is defined as follows: where f (·) denotes softmax, W y represent the weight, the output f ∼ z i denotes the probability of whether the input frame is a highlight, N is the number of LoL video sequences in a mini-batch, and j are all classes, including highlight and non-highlight.
Meanwhile, the discriminativeness of the features extracted from DENAN was improved using TL [26]. The batch-hard TL is defined as follows: wherez a ,z p ,z n are features extracted from the anchor, positive and negative samples, respectively, and is the margin hyper-parameter to control the differences between intra-and inter-distance. Here, positive and negative samples refer to the LoL video sequence with the same or different class from the anchor. In summary, the loss function for DENAN training is a combination of TL and CE losses, which is defined as follows: where λ controls the balance of TL and CE loss, λ is 1 in this work.

Datasets and evaluation metrics
We trained and evaluated our proposed DENAN on two datasets: NALCS [28] and LMS [28]. NALCS contained 218 videos of LoL in the 2017 Spring, with 128 videos divided into a training set, 40 videos divided into a validation set, and 50 videos divided into a test set. The average length of each video is between 30 and 50 min, with both highlight and non-highlight frames. The data labeling process has been described in detail [28]. The LMS dataset contained 103 videos of LoL, including 57 training videos, 18 validation videos, and 28 testing videos. In our experiment, training and validation sets of these two datasets were used for training, while test sets were used for testing.
Based on the commonly used metrics in video summarization tasks [6,8,29], we use precision (P), recall (R), and Fl-score (F1) as evaluation metrics in our experiments to evaluate the performance of DENAN. Let TP denotes positive frames that are correctly predicted, FP denotes positive frames with an incorrect prediction, FN denotes negative frames with an incorrect prediction. Then, the P, R, and F1 are calculated as follows: To evaluate the computation complexity of DENAN, we also introduce floating point operations (FLOPs) followed by number of parameters (Num Params) [14,30]. In particular, FLOPs represents the number of floating point operations the model performs when processing a sequence of data, while Num Params represents the number of DENAN parameters.

Implementation details
In LoL, the attractive action, such as escape and kill, occurs in the later part of the highlight fragment. Because no one can predict what will happen until the last moment, the later part of the highlight fragment is highly related to the final result. To generate a proper data format, we sampled 5000 positive frames from the last labeled positive frames as the real positive for training on NALCS and LMS datasets. Additionally, another 5000 negative frames were sampled over all negative frames. These sampled frames were considered the first frame for each video sequence, and the remaining frames were sampled every ten frames for each video sequence in a 720P 30 FPS video. During testing, the frames were evaluated every 30 frames in a 720P 30 FPS video.
Our proposed DENAN uses ImageNet-pretrained Shuf-fleNetV2 [14] as a backbone. The last two layers of Shuf-fleNetV2 (GAP and FC layers) are removed. All the frames in LoL videos are resized to 224 × 224. There are 32 LoL video sequences in each mini-batch, each of which contains 16 frames, resulting in 512 frames. During the training stage, the initial learning rate is set to 0.01 and is decreased to 0.001 in the 20th epoch. We set the maximum epoch of iterations to 60, which is sufficient to reach convergence. The SGD algorithm is used for optimizing the parameters. The momentum and weight decay for SGD are set to 0.9 and 10 -4 , respectively.

Discussion of the experimental results
In this section, we conduct a series of experiments on NALCS and LMS datasets to demonstrate the validity of all components in our proposed DENAN. Additionally, we investigated the effect of the TL margin and the length of the video sequence on model performance. Table 1 shows the results of an ablation study conducted for each component in DENAN. The ShuffleNetV2 network was trained with CE loss on the NALCS and LMS datasets as the baseline. NAN stands for non-local attention network and TL denotes triplet loss. Compared with the baseline, baseline + NAN improves the P, R, and Fl by 0.04, − 0.02, and 0.01, and 0.07, − 0.04, and 0.01 on NALCS and LMS datasets, respectively. Baseline + TL means that TL is combined with model training over baseline, which improves P, R, Fl by 0.01, 0.01, and 0.01, and 0.02, 0.01, and 0.01 on NALCS and LMS datasets, respectively. The objective of TL is to  In Fig. 5, we conduct experiments on both NALCS and LMS datasets with baseline + NAN + TL to evaluate the performance in terms of the TL margin, which controls the minimum distance between the hardest positive and hardest negative. In these experiments, our framework achieves the best result on NALCS and LMS datasets when α = 0.1. Smaller values for the TL margin yield promising results because TL focuses directly on the feature representation and can enhance the robustness of an only CE loss-trained backbone. When the TL margin is set to a larger value, DENAN overfitting occurs. Table 2 shows the experimental results on NALCS and LMS datasets for an ablation study based on the length of input LoL video sequences. For a fair comparison, we use the model trained with a video sequence length of 16 and evaluate the performance of different lengths (4, 8, and 16). Compared with T = 4, when T = 16, P, R, and F1 increased by 0.11, 0.04, and 0.08, respectively, on the NALCS dataset, whereas on the LMS dataset, P, R, and F1 increased by 0.08, 0.08, and 0.08, respectively. This is because our proposed DENAN captures the spatial-temporal dependency between different frames. In theory, as the length of the video sequence grows, the performance of the model should increase. However, the length is limited by the GPU memory, and we can only set a maximum length of 16.

Comparison with state-of-the-art methods
As shown in Table 3, our approach significantly outperforms the state-of-the-art methods based on the evaluation metrics with P, R, Fl, FLOPs, and Num Params on NALCS and LMS datasets.  In Table 3. P, R, and Fl of the proposed DENAN are 0.77, 0.74, and 0.76, respectively, on the NALCS dataset and 0.73, 0.78, and 0.76, respectively, on the LMS dataset. In particular, the Num Params and FLOPs of the proposed DENAN are 3.35 and 246.77 M, respectively. DR-DSN [8] develops video summarization as a sequential decision-making process and a deep summarization network to indicate the likelihood of a frame being selected to summarize the video. According to the probability distributions, selected frames are used as video highlights. Our method is 0.06 higher for Fl on the NALCS dataset, and 0.08 higher on the LMS dataset. In terms of Num Params and FLOPs, DENAN outperformed DR-DSN. However, DR-DSN first extracts frame-level features using GoogLeNet [31], and the extracted features are always local. Therefore, our model still has a computational complexity advantage. Iv-LSTM [28] is proposed for video highlight prediction in LoL based on joint visual features and textual analysis of the audience commentary. It achieves 0.79, 0.7, and 0.75, and 0.72, 0.68, and 0.7 on the NALCS and LMS datasets, respectively. Our proposed method outperforms Iv-LSTM by − 0.02, 0.04, and 0.01, and 0.01, 0.1, and 0.06 on the NALCS and LMS datasets, respectively. DENAN outperforms Iv-LSTM with only visual features, which demonstrates the effectiveness of DENAN for LoL video highlight detection.

Visualization of DENAN Performance
We compare the results of LoL video highlight made by Tencent and our proposed DENAN. Figure 6a, c show the LoL video highlight replay made by Tencent. Figure 6b, d show the LoL video highlight detection made by our proposed DENAN. Highlight frames and non-highlight frames are marked with green and red blocks, respectively. Most highlight frames are detected correctly, and our proposed DENAN outperforms state-of-the-art methods, including Tencent's current methods. Our proposed DENAN not only achieves accurate video highlight detection (higher P, R, and Fl), but it also has lower FLOPs and Num Params, enabling automatic and real-time video highlight detection. Figure 7 shows the visualization of our proposed method on LoL videos from the NALCS and LMS datasets. We denote video highlight and non-highlight frames as 1 and − 1. The upper part of Fig. 7a, b shows a comparison of ground truth (label) and prediction (predict), where the turquoise line represents the ground truth of each frame in the video, and the light-pink line represents the prediction of our proposed DENAN. The lower part shows the overlap between ground truth and prediction, with green indicating correct prediction and red indicating the opposite. As shown in Fig. 7, most positive labels are predicted correctly, which demonstrates the effectiveness of our method.

Conclusion
In this study, we propose a DENAN for LoL video highlight detection with low computation cost, in which light-weight ShuffleNetV2 video encoder is used to extract frame-level features from LoL video sequences, and non-local attention is used to capture spatial-temporal long-range dependencies. Training with CE loss and TL improves the performance of DENAN. Experimental results on the NALCS and LMS datasets demonstrate the validity of the proposed method.