An empirical study on temporal modeling for online action detection

Online action detection (OAD) is a practical yet challenging task, which has attracted increasing attention in recent years. A typical OAD system mainly consists of three modules: a frame-level feature extractor which is usually based on pre-trained deep Convolutional Neural Networks (CNNs), a temporal modeling module, and an action classifier. Among them, the temporal modeling module is crucial which aggregates discriminative information from historical and current features. Though many temporal modeling methods have been developed for OAD and other topics, their effects are lack of investigation on OAD fairly. This paper aims to provide an empirical study on temporal modeling for OAD including four meta types of temporal modeling methods, i.e. temporal pooling, temporal convolution, recurrent neural networks, and temporal attention, and uncover some good practices to produce a state-of-the-art OAD system. Many of them are explored in OAD for the first time, and extensively evaluated with various hyper parameters. Furthermore, based on our empirical study, we present several hybrid temporal modeling methods. Our best networks, i.e. , the hybridization of DCC, LSTM and M-NL, and the hybridization of DCC and M-NL, which outperform previously published results with sizable margins on THUMOS-14 dataset (48.6% vs. 47.2%) and TVSeries dataset (84.3% vs. 83.7%).


Introduction
Online action detection (OAD) is an important problem in computer vision, which has a wide range of applications like visual surveillance, human-computer interaction, and intelligent robot navigation, etc. Different from traditional action recognition and offline action detection that intend to recognize actions from fully observed videos, the goal of online action detection is to detect an action as it happens and ideally even before the action is fully completed.It is a very challenging problem due to the extra restriction on the usage of only historical and current information except for the difficulties of traditional action recognition in untrimmed video streams.
In general, there exist two OAD tasks, i.e. spatial-temporal online action detection (ST OAD) and temporal online action detection.With online setting, the former aims to localize actors and recognize actions in space-time which is introduced in [62], while the latter is to localize and recognize actions temporally only which is systematically introduced in [10].Our study mainly focuses on the temporal online action detection problem, and we ignore 'temporal' for convenience in the rest.

Action Classifier
Fig. 1 Online action detection aims to predict the ongoing action category from the historical and current frame information.A typical online action detection system is mainly composed of three parts: frame-level feature extraction, temporal modeling, and action classification As illustrated in Fig. 1, an online action detection (OAD) system mainly consists of three important parts: a frame-level feature extractor (e.g.deep Convolutional Neural Network, CNN), a temporal modeling module to aggregate framelevel features, and an action classifier.Recent works on online action detection mostly focus on the temporal modeling part, aiming to generate discriminative representations from the historical and current frame features.Inspired by the sequence modeling methods in other areas especially the Long Short-Term Memory recurrent network (LSTM) [28], various temporal modeling methods have been developed for online action detection recently.For example, Geest et al. [10] provided a LSTM-based baseline which shows superiority to the single-frame CNN model.Gao et al. [20] proposed a LSTM-based Reinforced Encoder-Decoder network for both action anticipation and online action detection.Geest et al. [11] proposed a two-stream feedback network, where one stream focuses on the input interpretation and the other models temporal dependencies between actions.Xu et al. [72] utilized LSTM cell to model temporal context aiming to improve online action detection by adding prediction information into observed information.
Although the above LSTM-based temporal modeling methods have significantly boosted the performance on existing OAD datasets (e.g.TVSeries [10], THUMOS-14 [32]), their superiority to other temporal models, e.g. , naive temporal pooling, temporal convolution, and attention-based sequence models, is not discussed and remains unknown.Moreover, the fusion of different temporal models is also rarely investigated.To address these problems, we provide a fair empirical study on temporal modeling for online action detection in the following aspects.

Exploration of temporal modeling methods
We explore four popular types of temporal modeling methods with various hyper parameters to fairly illustrate their effects for online action detection.They are namely temporal pooling, temporal convolution, recurrent neural networks, and temporal attention models.Specifically, for temporal pooling, we evaluate average pooling (AvgPool) and max pooling (MaxPool) with various sequence lengths.For temporal convolution, we evaluate traditional temporal convolution (TC), pyramid dilated temporal convolution (PDC) [39], and dilated causal convolution (DCC) [49].For recurrent neural networks, we evaluate LSTM and Gated Recurrent Unit (GRU) with two output choices, i.e. the last hidden state and the average hidden state.For temporal attention, we evaluate naive self-attention (Naive-SA) with a linear fully connected (FC) layer and Softmax function, nonlinear self-attention (Nonlinear-SA) with a FC-tanh-FC-Softmax architecture, Non-local block or standard self-attention with a skip connection, and our Modified Non-local (M-NL) with the current feature as the query (Q) and past information as the key (K ) and value (V ), which outperforms the traditional Non-local model.It is worth noting that (i) we try to keep the original names of these related methods in other topics though we make adaptions for online action detection, and (ii) many of these methods are introduced into online action detection for the first time to the best of our knowledge, such as TC, PDC, DCC, Non-local, etc. Overall, we extensively explore eleven individual temporal modeling methods with the off-the-shelf two-stream (TS) [58,66] frame features.
The hybridization of temporal modeling methods Generally, these sequence-to-sequence methods, e.g.PDC and LSTM, can be further processed by aggregation methods to create a single representation like temporal pooling and temporal attention.Thus, we present several hybrid temporal modeling methods which combine different temporal modeling methods aiming to uncover the complementarity among them.Interestingly, we find that a simple fusion between dilated causal convolution and our modified non-local or LSTM improves the individual models significantly.

Comparison with state-of-the-art
We extensively compare our individual models and hybrid temporal models to existing baselines and recent state-of-the-art methods.Several hybrid temporal models outperform the best existing performance with a sizable margin on both TVSeries and THUMOS-14.Specifically, the fusion of dilated causal convolution and M-NL obtains 84.3% cAP on TVSeries, and the fusion of dilated causal convolution, LSTM, and M-NL achieves 48.6% mAP on THUMOS-14.
In summary, the main contributions of this work can be concluded as follows.
-We provide a fair empirical study on eleven temporal modeling methods for online action detection and many of these methods are introduced into OAD for the first time, such as TC, PDC, DCC, non-local, etc. -We modify the traditional non-local block with current feature as query and past features as key and value, and term it as M-NL.With this operation, the past information to be weighted according to the dot-product similarity between the current feature, which outperforms the traditional non-local operation.-In Section Hybrid temporal modeling methods, we explore several hybrid temporal modeling methods to uncover the complementarity among them.The two hybrid models (i.e. , the hybridization of DCC, LSTM and M-NL, and the hybridization of DCC and M-NL) outperform state-of-the-art in extensive experiments on two benchmark datasets, i.e.TVSeries, THUMOS-14.

Related work
Our study is related to several other action-related tasks, namely action recognition, early action detection, action anticipation, temporal action detection, spatial-temporal action detection.In this section, we first briefly overview these related tasks separately and then present the recent works on online action detection.
Action recognition is an important branch of video-related research areas and has been extensively studied in the past decades.The existing methods are mainly developed for extracting discriminative action features from temporally complete action videos.These methods can be roughly categorized into hand-crafted feature based approaches and deep learning based approaches.Early methods such as Improved Dense Trajectory (IDT) [65] mainly adopted handcrafted features, such as Histograms of Oriented Optical Flow (HOF) [37], Histogram of Oriented Gradient (HOG) [37] and Motion Boundary Histograms (MBH) [?].Recent studies demonstrate that action features can be learned by deep learning methods such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).For example, two-stream network [58,66] learned appearance and motion features based on RGB frame and optical flow field separately.RNNs, such as long short-term memory (LSTM) [28] and gated recurrent unit (GRU) [7], have been used to model long-term temporal correlations and motion information in videos, and generate video representation for action classification.Some recent works also tried to model temporal information within a 2D-CNN instead of using 2D-CNN as static feature extractor, e.g. both TSM [41] and TAM [16] proposed an efficient approach to aggregate feature across frames inside the network.Another type of action recognition approach is based on 3D CNNs, which are widely used for learning largescale video datasets.C3D [13] was the first successful 3D CNN model for video classification.After that, many works extended C3D to different backbones, e.g.I3D [5] and ResNet3D [25].In addition, some works aimed to reduce the complexity of 3D CNN by decomposing the 3D convolution into 2D spatial convolution and 1D temporal convolution, e.g.P3D [51], S3D [69], R(2+1)D [14].
Early action detection This topic is similar to online action detection but focuses on recognizing actions from the partially observed videos, which usually assume that every video contains exactly one instance of a given action.Hoai et al. [27] proposed a max-margin framework with structured SVM to address this problem.Huang et al. [29] introduced Sequential Max-Margin Event Detectors (SMMED) to efficiently detect an event in the presence of a large number of event classes.Ma et al. [46] addressed the problem by training an LSTM network with ranking loss and merged the detection spans based on the frame-wise prediction scores generated by the LSTM.
Action anticipation aiming to predict future unseen actions with historical and current information.Many works have been developed for this task in recent years.For instance, Ryoo et al. [54] developed an early action prediction system by observing some evidences from the temporal accumulated features.Yu et al. [75] formulated the action prediction problem into a probabilistic framework, which aimed to maximize the posterior of activity given observed frames.Aliakbarian et al. [1] developed a multi-stage LSTM architecture that leveraged context-aware and action-aware features, and introduced a novel loss function that encouraged the model to predict the correct class as early as possible.Gao et al. [20] proposed a Reinforced Encoder-Decoder (RED) network for action anticipation, which used reinforcement learning to encourage the model to make the correct anticipations as early as possible.Ke et al. [35] proposed an attended temporal feature, which used multi-scale temporal convolutions to process the time-conditioned observation.The widely used datasets for action anticipation, e.g. , UCF-101 [63], JHMDB-21 [31], BIT-Interaction [36], Sports-1M [34], include short trimmed videos, and the task mainly focuses on predicting the class of the current going action timely from only a small ratio of the observed parts.Our task is different from action anticipation, we mainly focus on long and unsegmented video data, e.g.TVSeries, usually with a large variety of irrelevant background.
Temporal action detection or localization is another hot topic which aims to temporally localize and recognize actions by observing entire untrimmed videos.The main difference between this topic and OAD is the offline setting, i.e. postprocessing is allowed for temporal action localization.In this offline setting, the whole action can be observed first.The problem has recently received increasing attention due to its potential application in video data analysis.Shou et al. [57] localized actions with three stages: action proposal generation, proposal classification and proposal regression.Xu et al. [71] transformed the Faster R-CNN [52] architecture into temporal action localization.Chao et al. [6] improved recep-tive field alignment using a multi-tower network and dilated temporal convolutions, and exploited the temporal context of actions for both proposal generation and action classification.Lin et al. [42] generated proposals via learning starting and ending probability using temporal convolutional network, and achieved promising performance over previous methods.Zeng et al. [76] applied the Graph Convolutional Networks (GCNs) over the graph to model the relations among different proposals and learned powerful representations for the action classification and localization.
Spatial-temporal action detection aims to determine the precise spatial-temporal extents of actions in videos, which has attracted increasing attention recently.Early methods mainly resort to bag-of-words representation and search spatio-temporal path.In deep learning era, many works transform image-based object detection methods into this task, e.g.R-CNN [22], Faster R-CNN [52], SSD [45], etc.These adaptive methods mainly first detect actions in frame level and then link the frame-level bounding boxes into final tubes [23,24,50,61].Specially, the online setting is used in [61,62].
Online action detection is defined as an online per-frame labelling task given streaming videos, which requires correctly classifying every frame.Geest et al. [10] first introduced the problem by introducing a realistic dataset (i.e.TVSeries) and some baseline results.Their later work [11] introduced a two-stream feedback network, where one stream processed the input and the other one modelled the temporal relations.Li et al. [40] designed a deep LSTM network for 3D skeletons online action detection which also estimated the start and end frame of the current action.Xu et al. [72] proposed the Temporal Recurrent Network (TRN) to model the temporal context by simultaneously performing online action detection and anticipation.Eun et al. [15] designed a novel recurrent unit named Information Discrimination Unit to explicitly discriminate the information relevant to an ongoing action from others.Besides, Shou et al. [56] formulated the online detection of action start (ODAS) as a classification task of sliding windows and introduced a model based on Generative Adversarial Network (GAN) to generate hard negative samples to improve the training of the samples.Gao et al. [21] proposed StartNet to address ODAS which can be decomposed into two stages: action classification and start point localization.

Temporal modeling approach
In this section, we introduce four meta types of temporal modeling methods for online action detection, including temporal pooling, temporal convolution, recurrent neural network and temporal attention.

Problem formulation
Given an observed video stream V = {I 0 , I 1 , . . ., I t } containing frames from time 0 to t, the goal of online action detection is to recognize actions of interest occurring in frame I t with these observed frames.This is very different from other tasks like action recognition and temporal action detection which assume the entire video sequence is available at once.Formally, online action detection can be defined as the problem of maximizing the posterior probability as follows: where y t ∈ R K +1 is the possible action label vector for frame I t with K action classes and one background class.Thus, conditioned on the observed sequence V, the action label with the maximum probability P(y t |I 0 , I 1 , ..., I t ) is chosen to be the detection result of frame I t .Generally, a pre-trained CNN model is first used to extract frame-level features, e.g. the feature of tth frame where θ is the fixed parameter of the model and d is the dimension of feature embedding.Given the observed frame features { f 0 , f 1 , . . ., f t }, a temporal modeling module aims to aggregate discriminative information from them to better estimate the output action scores.

Temporal modeling
For online action detection, considering that faraway frames may be unrelated to the current action state, we usually input frames of a limited sequence length L to the temporal modeling module, i.e. { f t−L+1 , f t−L+2 , . . ., f t }.For convenience, we denote the input features as , and assume the output of temporal modeling as S out ∈ R d , where L is the length of the input feature sequence and d is the feature dimension.Note that both feature extraction (see Section 4.3) and action classifier are the same for all the temporal models.We focus on discussing four types of temporal modeling methods as illustrated in Fig. 2 in this section.

Temporal pooling
Temporal feature pooling has been extensively used for video classification [17,33,48,58] which is a simple method to generate video-level representation from frame-level features.As shown in Fig. 2A, given the input feature sequence

we consider two temporal pooling approaches:
Average pooling (AvgPool), i.e. S out = 1 L L t=1 f t , the generated feature S t is the average representation of the past information where L is the temporal length and d is the feature dimension, we investigate four meta types of temporal modeling methods, A temporal pooling with max or average operation; B temporal convolution; C recurrent Neural Network (RNN); D temporal attention.Specifically, for temporal convolution, we consider dilated temporal convolution with parallel architecture (i.e.PDC) and cascade architecture (i.e.DCC).For RNN, we consider LSTM and GRU cells with two output strategies.For temporal attention, we evaluate four types attention methods, namely (a) naive self-attention (Naive-SA) which adopts an auxiliary FC-Softmax architecture to learn weights for weighted summarization, b Nonlinear-SA which uses a FCtanh-FC-Softmax architecture to learn weights, c traditional non-local block, and d)modified non-local with the current feature as query and the past information as the key and value for online action detection.S out denotes the output single representation (marked as orange rectangle), '©' denotes concatenate operation, '⊕' denotes element-wise sum, and '⊗' denotes matrix multiplication Max pooling (MaxPool) over the temporal dimension, usually the discriminative feature representation is selected, i.e. S out = max t f t .
For temporal pooling models, the embedded feature vector S out ∈ R d is then fed into the action classifier to output a probability distribution of the current action over K action classes and one background by default.Note that this operation is the same for all the other temporal models.

Traditional temporal convolution Formally, given input
the dilated temporal convolution models output features as follows: ( where r is a dilation rate indicating the temporal stride to sample frames, W ∈ R d×s is a convolutional kernel, and s is the kernel size.It becomes our traditional temporal convolution (i.e.Conv1D without dilation) if r = 1.

Pyramid Dilated Temporal Convolution
The basic block for our PDC and DCC is shown in Fig. 3a, where each Tempo-ralBlock has a dilated Conv1D, a ReLU activation [47], a dropout for regularization and a 1 × 1 Conv1D for channel reduction optionally.As shown in Fig. 3b, PDC first separately conducts several TemporalBlock with various dilation rates {r 1 , r 2 , . . ., r N }, where N is the number of dilation rates, and then concatenates the outputs in temporal dimension.Formally, the concatenated feature at time t is defined as follows: where f (r N ) t are the outputs of the N th TemporalBlock.Thus, we get the concatenated feature sequence where and F in for the residual connection, respectively.In our study, we use three dilation rates {1, 2, 4} to efficiently enlarge the temporal receptive fields for temporal convolution models (i.e. , PDC and DCC), and to map the input sequence to an output sequence of the same length, zero padding of length (s − 1) * r is added in the Tempo-ralBlock.After the temporal convolutional operation, we use an average pooling layer to generate a single representation S out ∈ R d by default.For online action detection, the past as well as the current information are accumulated by the pooling operation.Thus, it is advantageous to make a decision for the current action based on the accumulated single representation.

Recurrent neural network
Recurrent Neural Network and its variants have recently been transformed from other sequence modeling topics into action classification [12,18,44,48] and detection [21,60,72].Specifically, we evaluate two popular recurrent cells, namely Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU).
Long short-term memory Given the input feature sequence the RNN cell can effectively accumulate information to learn a compact representation H t = {h t } L t=1 with the same length of the inputs.Specifically, at each time t, LSTM uses the previous hidden state h t−1 , the cell c t−1 , and the feature f t to update its hidden state h t and c t .Formally, LSTM is formulated as follows: where σ is the logistic sigmoid function, and i, g, c and o are, respectively, the input gate, forget gate, memory cell and output gate.h is the hidden state activation vector.W , U , V are the weight metrics, and b i , b j , b c , b o denote the bias vectors.
Gated recurrent unit Similar to LSTM unit, the GRU has gating units that modulate the flow of information inside the unit.The main difference between LSTM and GRU is that there is no separate memory cell in GRU.Formally, the GRU can be formulated as follows, where r t is a set of reset gates, z t is an update gate, and is an element-wise multiplication.ht is a candidate hidden state activation at time step t. tanh is a nonlinear activation.σ denotes the sigmoid function.

Output Strategy
We then consider two methods for the above two recurrent models to generate the final single representation S out , (i) following the traditional Encoder-Decoder method, we directly take the hidden state h L at the last time step, i.e. S out = h L .(ii) We average the outputs of all the time steps, i.e. S out = 1 L L t=1 h t .

Temporal attention
The attention mechanism [2,43,64,73] allows the model to selectively focus on only a subset of frames by increasing the attention weights of the corresponding temporal features, while ignoring irrelevant signals and noise.We evaluate four attention methods, namely (1) naive self-attention (Naive-SA), (2) nonlinear self-attention (Nonlinear-SA) with a FC-tanh-FC-Softmax architecture, (3) Non-local block or standard self-attention with a skip connection, and (4) our Modified Non-local with the current information as the query (Q) and past information as the memory (key(K ) and value(V )).
Naive self-attention Given the feature sequence F in = { f 1 , f 2 , . . ., f L }, the Naive-SA can be implemented by a linear fully-connected (FC) layer and Softmax function as follows: where W ∈ R d and b are weight and bias parameters of the FC, and a ∈ R L is the attention weight vector.F in T denotes feature transposition.
Nonlinear self-attention Similar to [73], we can also add more nonlinear operation as follows (i.e. the Nonlinear-SA): where is the bias vector, and U 2 ∈ R d 1 and b 2 are parameters of the second FC layer.With the above two attention weights, the output representation is the weighted average vector S out = L t=1 a t f t .
Non-local is another popular attention based model, which was originally proposed to capture long-range dependencies [67].The core idea of Non-local is to model correlation between contextual signals an attention mechanism.Specifically, it aims to encode the input sequence to a higherlevel representation by modeling the relationship between queries (Q), keys (K ) and values (V ) with, where the feature dimension of the encoded feature representations (usually, d q =d k =d v ).This architecture becomes standard "self-attention" [64] with Q=K =V ={ f 1 , f 2 , ..., f L }, as illustrated in Fig. 2D(c).Normally, we use two convolution layers followed by Batch Normalization [30] and ReLU [47] to generate two new features Q and K from F in , and the Non-local method [67] further adds a skip connection between the input and the output as follows: The updated temporal feature F is processed with average pooling by default to generate the final temporal representation S out = Avg(F ) ∈ R d .Modified non-local The query Q in Eq. ( 9) can also be a single feature vector, similar to [68] which replaces the self-attention weights by the one between local feature and long-term features.We compute the dot-product attention between current feature f L and historical features F = { f 1 , f 2 , ..., f L−1 } as illustrated in Fig. 4, and we denote this operation as Modified Non-local (M-NL).Different from traditional non-local operation, this adaption is based on the assumption that the current feature is the most important one for online action detection, and the past information to be weighted according to the dot-product similarity between the current feature.With this operation, an attention weight vector a ∈ R L−1 is obtained and used to get the final representation as follows: Training and inference With the output representation S out of temporal modeling module, we use a linear fully-connected (FC) layer with Softmax function for action classification, and train the whole network with Cross-Entropy loss.Specifically, we divide the feature sequence of a video into non-overlapped windows with size L as the input of our temporal modeling module.At test stage, sliding window with size L and stride 1 is used to formulate the input, and the prediction is made for the last frame.

Experimental configuration
In this section, we first introduce two widely-used OAD datasets, i.e.TVSeries and THUMOS-14, and then describe our implementation details, including unit-level feature extraction and hyperparameter settings.THUMOS-14 [32] is a popular benchmark for temporal action detection.It contains over 20 hours of sport videos annotated with 20 actions (e.g.diving, high jump, throw discus, etc.).The training set (i.e.UCF101 [63]) contains only trimmed videos that cannot be used to train temporal action detection models.Following prior works [20,72], we train our model on the validation set (including 3K action instances in 200 untrimmed videos) and evaluate on the test set (including 3.3K action instances in 213 untrimmed videos).
To investigate the characters of the used datasets, we depict the temporal length distributions of action instances on TVSeries and THUMOS-14 in Fig. 5.We observe that 70% of action instances are very short on TVSeries (i.e.0-2s) while half of instances are longer than 3 seconds on THUMOS-14.

Evaluation protocols
For each class on TVSeries, we use the per-frame calibrated average precision (cAP) which is proposed in [10], where calibrated precision cPrec = TP TP+FP/w , I (k) is an indicator function that is equal to 1 if the cut-off frame k is a true positive, P denotes the total number of true positives, and w is the ratio between negative and positive frames.The mean cAP over all classes is reported for final performance.The advantage of cAP is that it is fair for class imbalance condition.For THUMOS-14, we report per-frame mean Average Precision (mAP) performance.

Implementation details
Unit-level feature extraction Following previous works [19,20,42,72], a long untrimmed video is first cut into video units without overlap, each unit contains n u continuous frames.A video chunk u is processed by a visual encoder E v to extract the unit-level representation In our experiments, we extract frames from all videos at 24 frames per second.The video unit size n u is set to 6, i.e. 0.25 s.We use two-stream [70] network as the visual encoder E v that is pre-trained on ActivityNet-1.3 [4].In each unit, the central frame is sampled to calculate the appearance CNN feature, it is the Flatten 673 layer of ResNet-200 [26].For the motion feature, we sample 6 consecutive frames at the center of a unit and calculate optical flows between them.These flows are then fed into the pretrained BN-Inception model [30], and the output of global pool layer is extracted.The motion features and the appearance features are both 2048-D, and are concatenated into 4096-D vectors (i.e.d = 4096), which are used as unit-level features.
Hyperparameter setting For the PDC model, the concatenate features are fed into an addition 1 × 1 convolution to reduce the feature dimensions to 4096.For the DCC model, we use 3 dilated convolution layers, each of which is comprised of one dilated convolution with kernel size s = 2, stride 1, followed by a ReLU and dropout.The output dimension of the second layer d r 1 is set to 2048.For the Non-local attention model, we set d m to 512 and d v to 4096.According to the number of action classes, we set K + 1 to 31 for TVSeries and 21 for THUMOS-14.Our experiments are conducted in Pytorch and we use Stochastic Gradient Descent (SGD) [53] optimizer to train the network from scratch.The leaning rate, momentum and decay rate are set to 10 −3 , 0.9 and 0.95, respectively.All of our experiments are implemented with 8 GTX TITAN X GPU, Intel i7 CPU, and 128GB memory.

Behavioural study of temporal modeling methods
In this section, we first present a quick comparison among the best settings of the four mentioned temporal modeling methods, and then extensively explore both individual temporal modeling methods and their combinations, and finally compare our results to the state-of-the-art.

A quick comparison of temporal modeling methods
As mentioned in the Section Introduction, we totally explore eleven temporal modeling methods from four meta types, including two temporal pooling methods, three temporal convolution methods, two recurrent neural networks, and four temporal attention methods.For a quick glance, we select the best setting in each model, and compare their performance, the results are shown in Table 1.For a fair comparison, the input sequence length L is fixed to 4. Several observations can be concluded as follows: (1) For temporal pooling models, AvgPool consistently works better than MaxPool; For temporal convolution models, DCC achieves the best results on both TVSeries and THUMOS-14 than TC and PDC, which indicates that discriminative information can be obtained effectively by stacking dilated causal convolution.
(2) M-NL performs slightly better than AvgPool, which demonstrates the effectiveness of attention mechanism.Third, LSTM outperforms M-NL and AvgPool with sizable margins on both datasets, which shows that the temporal dependencies captured by LSTM is crucial for accurate online action detection.(3) Overall, an interesting finding is that the temporaldependent methods, i.e. temporal convolution and RNN, are superior to these temporal-independent methods for online action detection.
Model efficiency analysis Table 1 also compares the number of parameters and FLOPs of each meta type.The FLOPs and parameters indicate the number of operations to process the inputs, which is a vital factor, especially for online action detection.For a fair comparison, the input sequence length L is fixed as 4 by default and we test all the models under the same environment of GTX TITAN X GPU.Number of parameters of AvgPool, DCC, LSTM and M-NL are 0B, 151M, 134M and 21M, respectively.For the comparison of performance and computation, LSTM model would be a compromise choice with similar model accuracy, less number of parameters and computational cost in FLOPs.

Temporal pooling
We test two temporal pooling methods (i.e.average pooling and max pooling) with different sequence lengths.The results are shown in Fig. 6.We also compare them to the baseline that uses a fully-connected (FC) layer and Softmax to gener-  (cAP) and 36.3%(mAP) on TVSeries and THUMOS-14, respectively.For temporal pooling, it is clear that average pooling consistently performs better than max pooling on both datasets.Increasing the sequence length improves both pooling methods in the beginning and degrades them dramatically after the saturation length.This can be explained by that appropriate historical information introduces useful context for online action detection while long-term historical information may introduce unrelated information and may also smooth the final representation.Another observation is that increasing the sequence length after L = 4 is seriously harmful for TVSeries while not for THUMOS-14.This effect indicates a fact that each video in TVSeries contains multiple actions and numerous varied background frames while each video in THUMOS-14 only contains one action instance.Overall, the simple AvgPool method (L = 4), respectively, improves the baselines on TVSeries and THUMOS-14 by 1.4% and 5.2%.

Temporal convolution
As shown in Table 2, we compare temporal convolution models with different kernel size s and dilation rate r , denoted as (s, r ).For PDC and DCC, we use temporal convolutional filters with kernel size s = 2 as a building block.The input sequence length is fixed to L = 4 for all the comparison experiments.To obtain output with equal length as the input, we add zero padding as it needs.Several observations can be concluded as follows.
(1) The comparison between TC(2,1) and TC (3,1) indicates that the kernel size s = 2 is slightly better than s = 3 on both datasets.
(2) The comparison among TC(2,1), TC(2,2), and TC (2,4) shows that different dilation rates perform similarly on both datasets.(3) Both PDC and DCC which combines TC(2,1), TC(2,2), and TC(2,4) in either parallel or cascade manner significantly improve the traditional TC models.This demonstrates that combining multi-dilation temporal convolution layers can capture complementary multi-scale action information.(4) DCC with cascade manner obtains the best results, leading to a performance of 83.1% on TVSeries and 46.8% on THUMOS-14.The success of dilated causal convolution with cascade manner suggests that discriminative and relevant information could be enhanced layer by layer from the smaller temporal receptive field (r = 1) to the large one (r = 4).

Recurrent neural network
We evaluate LSTM and GRU in the following four aspects: input sequence length, output strategy, hidden size, and the number of recurrent layers.
Input sequence length and output strategy For these two factors, we vary the sequence length from 2 to 16, and evaluate two alternative output strategies including the last hidden state S out = h L and the average hidden state S out = 1 L L t=1 h t .The hidden size is fixed to 4096 and only one recurrent layer is used for this evaluation.Figure 9 illustrates the comparison results for LSTM and GRU.Several conclusions can be drawn as follows.
(1) The 'last hidden state' strategy performs consistently better than the 'average hidden state'.It can be explained by that both LSTM and GRU automatically accumulate discriminative information into the last state by their temporal dependency operations while averaging all the hidden states may introduce unrelated or noisy information for online action detection.(2) LSTM performs better than GRU on THUMOS-14 while similarly or worse on TVSeries.This indicate that the separate memory cell in LSTM is helpful to capture more context information which is crucial for THUMOS-14 while too much context (unrelated actions or background) can degrade performance on TVSeries.(3) The effect of sequence length for both LSTM and GRU is the same as the one for pooling methods, and the best trade-off sequence length is 4 on both datasets.The number of recurrent layers Generally, one can easily stack several recurrent layers to model the complex dependency of sequences.To this end, we evaluate the number of recurrent layers for both LSTM and GRU on TVSeries and THUMOS-14.The results are shown in Fig. 8. Interestingly, adding one more layer does not bring performance gain and even dramatically degrades the performance for LSTM on both datasets.The main problem is that adding one more recurrent layer can double the number of parameters leading to overfitting easily.

Temporal attention
We compare four different attention models mentioned in Section Temporal modeling, i.e.Naive-SA as described in Eq. ( 7), Nonlinear-SA as described in Eq. ( 8), Non-local as described in Eq. ( 10), and M-NL as described in Eq. (11).
As shown in Table 1, for temporal attention models, several observations can be concluded as follows.
(1) Nonlinear-SA outperforms Naive-SA by 0.8% on TVSeries and 2.6% on THUMOS-14.Compared to Naive-SA, Nonlinear-SA computes attention weights with one more nonlinear tanh and linear FC which may be more effective for modeling the complex temporal relationships.(2) Non-local performs equally to Nonlinear-SA on both datasets, indicating that they share the similar attention mechanism more or less.(3) Our Modified Non-local with current feature as a query and past features as key and value performs better than Non-local by 0.6% on TVSeries and 0.9% on THUMOS-14, showing the effectiveness of our proposed design (i.e.computing the attention between current feature with historical features) for online action detection.
As there is a hyper-parameter d 1 in Nonlinear-SA (see Eq. ( 8)) which can impact the final performance, we also conduct an evaluation in Table 3.We observe d 1 = 512 (1024) yield the best performance for TVSeries (THUMOS-14), and the final performance is not very sensitive to it.
Fig. 10 The proposed hybrid framework for online action detection, which consists a DCC module, a LSTM layer and a M-NL module

Hybrid temporal modeling methods
Generally, these sequence-to-sequence temporal models, e.g.DCC and LSTM can be further processed by aggregation methods like temporal pooling and temporal attention to generate a single representation.Based on our empirical study of temporal models, we choose the best setting of each temporal modeling method, i.e.LSTM, DCC, AvgPool and M-NL, and propose several novel hybrid temporal modeling methods as presented in Table 4, aiming to uncover the complementarity among them.We mainly combine those temporal-dependent models and temporal-independent models.To clarify the architecture of hybrid models, we illustrate M3 (i.e.DCC ⊕ LSTM ⊕ M-NL) in Fig. 10  The results of the hybrid methods on TVSeries and THUMOS-14 are shown in Table 4. Several observations can be concluded as follows.
(1) The best results on TVSeries and THUMOS-14 are achieved by M2 and M3, respectively.Specifically, with the combination of DCC and M-NL models, we achieve mean cAP of 84.3% on TVSeries.By further combining LSTM among them, we obtain mAP of 48.6% on THUMOS-14.degrades the performance by 1.3% on TVSeries while increases the one by 1.4% on THUMOS-14.This may be explained by that temporal dependencies are important for these long-term action instances of THUMOS-14 while harmful for the dominated short-term action instances of TVSeries.Specifically, DCC uses dilated convolutions to systematically capture multi-scale temporal contextual information with the exponential expansion of the receptive field.By combining LSTM, the long-term temporal dependencies can be further accumulated at the last hidden states, but this operation also lead to the irrelevant information to be accumulated, especially for TVSeries dataset, since different actions can occur at the same time and being performed by the same or multiple actors, as opposed to the setting of THUMOS-14, where actions are separated by a specific non-action.(5) Comparing M3→M4 and M5→M6, the order of LSTM and DCC makes difference on both datasets, the DCCfirst order performs better than the LSTM-first order especially for TVSeries, showing the effectiveness of  The top-2 results are boldly marked and [15] is available online after our submission Single-frame CNN [59] Two-stream CNN [58] C3D+LinearInterp [55] Predictive-corrective [9] LSTM [12] MultiLSTM [74] CDC [55] RED [20] TRN [72] IDN [15] 50.0 The top-2 results are boldly marked and [15] is available online after our submission DCC to discriminate relevant and filter out irrelevant information for online action detection.

Comparison with state-of-the-art
In this section, we compare our best results to the state-ofthe-art approaches on TVSeries and THUMOS-14 datasets.As shown in Table 5, with two-stream features, we achieve 84.3% in terms of mean cAP on TVSeries which outperforms the recent sophisticated-designed TRN [72] by 0.6%.In addition, our hybrid method yield better performance than TRN with shorter temporal information (i.e. , 4 chunks vs.

chunks)
. Besides, we also present the comparison of ours with previous method [10] for each action class on TVSeries in Fig. 11.Our method can always outperform CNN and LSTM by a large margin except for action class Use computer and Write with only subtle change happens in the two scenes.
In Table 6, we compare performances between our proposed hybrid method and the state-of-the-art approaches for online and offline action detection on THUMOS-14 dataset.The compared offline action detection methods perform frame-level prediction.As a result, we achieve mAP of 48.6% which outperforms TRN [72] by 1.4%.

Conclusions
In this paper, we provide an empirical study on temporal modeling for online action detection including four meta types of temporal modeling methods, i.e. temporal pooling, temporal convolution, recurrent neural networks, and temporal attention.We extensively explore eleven individual temporal modeling methods and explore several hybrid temporal models which combine temporal-dependent models with temporal-independent models to uncover the complementarity among them.Based on our empirical study, we find that a simple hybridization of dilated causal convolution and M-NL or LSTM improves the individual models significantly and also outperforms the best existing performance with a sizable margin on both TVSeries and THUMOS-14 datasets.

Declarations
Conflict of interest On behalf of all authors, the corresponding authors state that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy-right holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Fig. 2
Fig.2Given and input feature sequence (marked as pink rectangles or circles)F in = { f 1 , f 2 , . . ., f L } ∈ R L×d ,where L is the temporal length and d is the feature dimension, we investigate four meta types of temporal modeling methods, A temporal pooling with max or average operation; B temporal convolution; C recurrent Neural Network (RNN); D temporal attention.Specifically, for temporal convolution, we consider dilated temporal convolution with parallel architecture (i.e.PDC) and cascade architecture (i.e.DCC).For RNN, we consider LSTM and GRU cells with two output strategies.For temporal attention, we

14 Fig. 5
Fig. 5 The temporal length distributions of action instances

Fig. 6
Fig. 6 Comparison between average pooling and max pooling with different sequence length L as input

Fig. 7 4 Fig. 8 Fig. 9
Fig. 7 Evaluation of hidden size for LSTM with the last hidden state output strategy and sequence length L = 4 as an example.As shown in Fig. 10, the output sequence of DCC is further processed by LSTM aiming to capture strong temporal dependency, and finally, the M-NL module is used to generate the single representation for classification.The other hybrid models are described as follows: M1 LSTM ⊕ M-NL: The input feature sequence are first fed into LSTM to update the hidden states at all time steps, and these are further fed into our Modified Nonlocal with the last hidden state as the Q and the other hidden states as K and V to generate a single representation, and the classification is performed on the representation.A diagram is similar to Fig.10, while not containing the DCC model; M2 DCC ⊕ M-NL: The output of the DCC network before the average pooling layer is the same as the input sequence length, and our Modified Non-local is performed on the output sequence of DCC to generate a single representation for current action classification.A diagram is similar to Fig. 10, while not containing the LSTM model; M4 LSTM ⊕ DCC ⊕ M-NL: The hidden states at all time steps of LSTM layer are first fed into DCC and then the M-NL.This model is similar to M3 except for that it swaps the order of DCC and LSTM; M5 DCC ⊕ LSTM: The output sequence of DCC is processed by LSTM to update the hidden states with the same length of the inputs, and then the last hidden state is used for action classification.A diagram is similar to Fig. 10, while not containing the M-NL model; M6 LSTM ⊕ DCC ⊕ AvgPool: This model replaces the M-NL of M3 with AvgPool for the output of DCC to generate the final representation for classification.

Fig. 11
Fig.11The online detection results of ours compared to previous methods in terms of per-frame cAP (%) for each action class on TVSeries To keep the same feature dimension of the inputs, Illustration of A TemporalBlock, B PDC and C DCC.In PDC and DCC, TemporalBlock with different dilation rate r = {1, 2, 4} are used to efficiently enlarge the temporal receptive fields and Conv1D with kernel size s = 1 is used for channel reduction.L denotes the temporal length and d denotes the channel dimension of the input and output feature sequence.'©' denotes concatenate operation and '⊕' denotes element-wise sum a Conv1D with kernel size s = 1 is added optionally for channel reduction.Dilated Causal ConvolutionAs shown in Fig.3.(c),our dilated causal convolution (DCC) stacks several Temporal-Block with different dilation rates r .For each layer, we increase dilation rate r exponentially with the depth i of the network (i.e., r = O(2 i )) and change the feature dimension to d i at level i of the network.The resulting feature is fused with the input F in by residual connection, leading to the updated features f DCC .The computation of DCC can be denoted as

Table 1
A quick comparison among the best settings of different meta types of temporal modeling methods

Table 5
Comparison with state-of-the-art methods in terms of perframe cAP (%) on TVSeries

Table 6
Comparison with published state-of-the-art methods in terms of per-frame mAP (%) on THUMOS-14