1 Introduction

Video understanding and, specifically, action classification have benefited a lot from deep learning and the larger datasets that have been created in recent years. The new datasets, such as Kinetics [20], ActivityNet [13], and SomethingSomething [10] have contributed more diversity and realism to the field. Deep learning provides powerful classifiers at interactive frame rates, enabling applications like real-time action detection [30].

While action detection, which quickly decides on the present action within a short time window, is fast enough to run in real-time, activity understanding, which is concerned with longer-term activities that can span several seconds, requires the integration of the long-term context to achieve full accuracy. Several 3D CNN architectures have been proposed to capture temporal relations between frames, but they are computationally expensive and, thus, can cover only comparatively small windows rather than the entire video. Existing methods typically use some post-hoc integration of window-based scores, which is suboptimal for exploiting the temporal relationships between the windows.

In this paper, we introduce a straightforward, end-to-end trainable architecture that exploits two important principles to avoid the above-mentioned dilemma. Firstly, a good initial classification of an action can already be obtained from just a single frame. The temporal neighborhood of this frame comprises mostly redundant information and is almost useless for improving the belief about the present actionFootnote 1. Therefore, we process only a single frame of a temporal neighborhood efficiently with a 2D convolutional architecture in order to capture appearance features of such frame. Secondly, to capture the contextual relationships between distant frames, a simple aggregation of scores is insufficient. Therefore, we feed the feature representations of distant frames into a 3D network that learns the temporal context between these frames and so can improve significantly over the belief obtained from a single frame – especially for complex long-term activities. This principle is much related to the so-called early or late fusion used for combining the RGB stream and the optical flow stream in two-stream architectures [8]. However, this principle has been mostly ignored so far for aggregation over time and is not part of the state-of-the-art approaches.

Consequent implementation of these two principles together leads to a high classification accuracy without bells and whistles. The long temporal context of complex actions can be fully captured, whereas the fact that the method only looks at a very small fraction of all frames in the video leads to extremely fast processing of entire videos. This is very beneficial especially in video retrieval applications.

Additionally, this approach opens the possibility for online video understanding. In this paper, we also present a way to use our architecture in an online setting, where we provide a fast first guess on the action and refine it using the longer term context as a more complex activity establishes. In contrast to online action detection, which has been enabled recently [30], the approach provides not only fast reaction times, but also takes the longer term context into account.

We conducted experiments on various video understanding problems including action recognition and video captioning. Although we just use RGB images as input, we obtain on-par or favorable performance compared to state-of-the-art approaches on most datasets. The runtime-accuracy trade-off is superior on all datasets.

2 Related Work

Video Classification with Deep Learning. Most recent works on video classification are based on deep learning [6, 19, 29, 34, 48]. To explore the temporal context of a video, 3D convolutional networks are on obvious option. Tran et al. [33] introduced a 3D architecture with 3D kernels to learn spatio-temporal features from a sequence of frames. In a later work, they studied the use of a Resnet architecture with 3D convolutions and showed the improvements over their earlier c3d architecture [34]. An alternative way to model the temporal relation between frames is by using recurrent networks [6, 24, 25]. Donahue et al. [6] employed a LSTM to integrate features from a CNN over time. However, the performance of recurrent networks on action recognition currently lags behind that of recent CNN-based methods, which may indicate that they do not sufficiently model long-term dynamics [24, 25]. Recently, several works utilized 3D architectures for action recognition [5, 35, 39, 48]. These approaches model the short-term temporal context of the input video based on a sliding window. At inference time, these methods must compute the average score over multiple windows, which is quite time consuming. For example, ARTNet [39] requires on average 250 samples to classify one video.

All these approaches do not sufficiently use the comprehensive information from the entire video during training and inference. Partial observation not only causes confusion in action prediction, but also requires an extra post-processing step to fuse scores. Extra feature/score aggregation reduces the speed of video processing and disables the method to work in a real-time setting.

Long-Term Representation Learning. To cope with the problem of partial observation, some methods increased the temporal resolution of the sliding window [4, 36]. However, expanding the temporal length of the input has two major drawbacks. (1) It is computationally expensive, and (2) still fails to cover the visual information of the entire video, especially for longer videos.

Some works proposed encoding methods [26, 28, 42] to learn a video representation from samples. In these approaches, features are usually calculated for each frame independently and are aggregated across time to make a video-level representation. This ignores the relationship between the frames.

To capture long-term information, recent works [2, 3, 7, 40] employed a sparse and global temporal sampling method to choose frames from the entire video during training. In the TSN model [40], as in the aggregation methods above, frames are processed independently at inference time and their scores are aggregated only in the end. Consequently, the performance in their experiments stays the same when they change the number of samples, which indicates that their model does not really benefit from the long-range temporal information.

Our work is different from these previous approaches in three main aspects: (1) Similar to TSN, we sample a fixed number of frames from the entire video to cover long-range temporal structure for understanding of video. In this way, the sampled frames span the entire video independent of the length of the video. (2) In contrast to TSN, we use a 3D-network to learn the relationship between the frames throughout the video. The network is trained end-to-end to learn this relationship. (3) The network directly provides video-level scores without post-hoc feature aggregation. Therefore, it can be run in online mode and in real-time even on small computing devices.

Video Captioning. Video captioning is a widely studied problem in computer vision [9, 14, 43, 45]. Most approaches use a CNN pre-trained on image classification or action recognition to generate features [9, 43, 45]. These methods, like the video understanding methods described above, utilize a frame-based feature aggregation (e.g. Resnet or TSN) or a sliding window over the whole video (e.g. C3D) to generate video-level features. The features are then passed to a recurrent neural network (e.g. LSTM) to generate the video captions via a learned language model. The extracted visual features should represent both the temporal structure of the video and the static semantics of the scene. However, most approaches suffer from the problem that the temporal context is not properly extracted. With the network model in this work, we address this problem, and can consequently improve video captioning results.

Real-Time and Online Video Understanding. Deep learning accelerated image classification, but video classification remains challenging in terms of speed. A few works dealt with real-time video understanding [18, 30, 31, 44]. EMV [44] introduced an approach for fast calculation of motion vectors. Despite this improvement, video processing is still slow. Kantorov [18] introduced a fast dense trajectory method. The other works used frame-based hand-crafted features for online action recognition [15, 22]. Both accuracy and speed of feature extraction in these methods are far from that of deep learning methods. Soomro et al. [31] proposed an online action localization approach. Their model utilizes an expensive segmentation method which, therefore, cannot work in real-time. More recently, Singh et al. [30] proposed an online detection approach based on frame-level detections at 40 fps. We compare to the last two approaches in Sect. 5.

Fig. 1.
figure 1

Architecture overview of ECO Lite. Each video is split into N subsections of equal size. From each subsection a single frame is randomly sampled. The samples are processed by a regular 2D convolutional network to yield a representation for each sampled frame. These representations are stacked and fed into a 3D convolutional network, which classifies the action, taking into account the temporal relationship.

3 Long-Term Spatio-Temporal Architecture

The network architecture is shown in Fig. 1. A whole video with a variable number of frames is provided as input to the network. The video is split into N subsections \(S_i\), \(i=1,...,N\) of equal size, and in each subsection, exactly one frame is sampled randomly. Each of these frames is processed by a single 2D convolutional network (weight sharing), which yields a feature representation encoding the frame’s appearance. By jointly processing frames from time segments that cover the whole video, we make sure that we capture the most relevant parts of an action over time and the relationship among these parts.

Randomly sampling the position of the frame is advantageous over always using the same position, because it leads to more diversity during training and makes the network adapt to variations in the instantiation of an action. Note that this kind of processing exploits all frames of the video during training to model the variation. At the same time, the network must only process N frames at runtime, which makes the approach very fast. We also considered more clever partitioning strategies that take the content of the subsections into account. However, this comes with the drawback that each frame of the video must be processed at runtime to obtain the partitioning, and the actual improvement by such smarter partitioning is limited, since most of the variation is already captured by the random sampling during training.

Fig. 2.
figure 2

(A) ECO Lite architecture as shown in more detail in Fig. 1. (B) Full ECO architecture with a parallel 2D and 3D stream.

Up to this point, the different frames in the video are processed independently. In order to learn how actions are made up of the different appearances of the scene over time, we stack the representations of all frames and feed them into a 3D convolutional network. This network yields the final action class label.

The architecture is very straightforward, and it is obvious that it can be trained efficiently end-to-end directly on the action class label and on large datasets. It is also an architecture that can be easily adapted to other video understanding tasks, as we show later in the video captioning Sect. 5.4.

3.1 ECO Lite and ECO Full

The 3D architecture in ECO Lite is optimized for learning relationships between the frames, but it tends to waste capacity in case of simple short-term actions that can be recognized just from the static image content. Therefore, we suggest an extension of the architecture by using a 2D network in parallel; see Fig. 2(B). For the simple actions, this 2D network architecture can simplify processing and ensure that the static image features receive the necessary importance, whereas the 3D network architecture takes care of the more complex actions that depend on the relationship between frames.

The 2D network receives feature maps of all samples and produces N feature representations. Afterwards, we apply average pooling to get a feature vector that is a representative for static scene semantics. We call the full architecture ECO and the simpler architecture in Fig. 2(A) ECO Lite.

3.2 Network Details

2D-Net: For the 2D network (\(\mathcal {H}_{\small {2D}}\)) that analyzes the single frames, we use the first part of the BN-Inception architecture (until inception-3c layer) [17]. Details are given in the supplemental material. It has 2D filters and pooling kernels with batch normalization. We chose this architecture due to its efficiency. The output of \(\mathcal {H}_{\small {2D}}\) for each single frame consist of 96 feature maps with size of \( 28 \times 28\).

3D-Net: For the 3D network \(\mathcal {H}_{\small {3D}}\) we adopt several layers of 3D-Resnet18 [34], which is an efficient architecture used in many video classification works [34, 39]. Details on the architecture are provided in the supplemental material. The output of \(\mathcal {H}_{\small {3D}}\) is a one-hot vector for the different class labels.

2D-Net\(_{{\varvec{S}}}\) : In the ECO full design, we use 2D-\(Net_{s}\) in parallel with 3D-net to directly providing static visual semantics of video. For this network, we use the BN-Inception architecture from inception-4a layer until last pooling layer [17]. The last pooling layer will produce 1024 dimensional feature vector for each frame. We apply average pooling to generate video-level feature and then concatenate with features obtained from 3D-net.

3.3 Training Details

We train our networks using mini-batch SGD with Nesterov momentum and utilize dropout in each fully connected layer. We split each video into N segments and randomly select one frame from each segment. This sampling provides robustness to variations and enables the network to fully exploit all frames. In addition, we apply the data augmentation techniques introduced in [41]: we resize the input frames to 240 \(\times \) 320 and employ fixed-corner cropping and scale jittering with horizontal flipping (temporal jittering provided by sampling). Afterwards, we run per-pixel mean subtraction and resize the cropped regions to 224 \(\times \) 224.

The initial learning rate is 0.001 and decreases by a factor of 10 when validation error saturates for 4 epochs. We train the network with a momentum of 0.9, a weight decay of 0.0005, and mini-batches of size 32.

We initialize the weights of the 2D-Net weights with the BN-Inception architecture [17] pre-trained on Kinetics, as provided by [41]. In the same way, we use the pre-trained model of 3D-Resnet-18, as provided by [39] for initializing the weights of our 3D-Net. Afterwards, we train ECO and ECO Lite on the Kinetics dataset for 10 epochs.

For other datasets, we finetune the above ECO/ECO Lite models on the new datasets. Due to the complexity of the Something-Something dataset, we finetune the network for 25 epochs reducing the learning rate every 10 epochs by a factor of 10. For the rest, we finetune for 4k iterations and the learning rate drops by a factor of 10 as soons as the validation loss saturates. The whole training process on UCF101 and HMDB51 takes around 3 h on one Tesla P100 GPU for the ECO architecture. We adjusted the dropout rate and the number of iterations based on the dataset size.

Fig. 3.
figure 3

Scheme of our sampling strategy for online video understanding. Half of the frames are sampled uniformly from the working memory in the previous time step, the other half from the queue (Q) of incoming frames.

3.4 Test Time Inference

Most state-of-the-art methods run some post-processing on the network result. For instance, TSN and ARTNet [39, 41], collect 25 independent frames/volumes per video, and for each frame/volume sample 10 regions by corner and center cropping, and their horizontal flipping. The final prediction is obtained by averaging the scores of all 250 samples. This kind of inference at test time is computationally expensive and thus unsuitable for real-time setting.

In contrast, our network produces action labels for the whole video directly without any additional aggregation. We sample N frames from the video, apply only center cropping then feed them directly to the network, which provides the prediction for the whole video with a single pass.

4 Online Video Understanding

Most works on video understanding process in batch mode, i.e., they assume that the whole video is available when processing starts. However, in several application scenarios, the video will be provided as a stream and the current belief is supposed to be available at any time. Such online processing is possible with a sliding window approach, yet this comes with restrictions regarding the size of the window, i.e., long-term context is missing, or with a very long delay.

In this section, we show how ECO can be adapted to run very efficiently in online mode, too. The modification only affects the sampling part and keeps the network architecture unchanged. To this end, we partition the incoming video content into segments of N frames, where N is also the number of frames that go into the network. We use a working memory \(S_N\), which always comprises the N samples that are fed to the network together with a time stamp. When a video starts, i.e., only N frames are available, all N frames are sampled densely and are stored in the working memory \(S_N\). With each new time segment, N additional frames come in, and we replace half of the samples in \(S_N\) by samples from this time segment and update the prediction of the network; see Fig. 3. When we replace samples from \(S_N\), we uniformly replace samples from previous time segments. This ensures that changes can be anticipated in real time, while the temporal context is taken into account and slowly fades out via the working memory. Details on the update of \(S_N\) are shown in Algorithm 1.

The online approach with ECO runs at 675 fps (and at 970 fps with ECO Lite) on a Tesla P100 GPU. In addition, the model is memory efficient by just keeping exactly N frames. This enables the implementation also on much smaller hardware, such as mobile devices. The video in the supplemental material shows the recorded performance of the online version of ECO in real-time.

figure a

5 Experiments

We evaluate our approach on different video understanding problems to show the generalization ability of approach. We evaluated the network architecture on the most common action classification datasets in order to compare its performance against the state-of-the-art approaches. This includes the older but still very popular datasets UCF101 [32] and HMDB51 [21], but also the more recent datasets Kinetics [20] and Something-Something [10]. Moreover, we applied the architecture to video captioning and tested it on the widely used Youtube2text dataset [11]. For all of these datasets, we use the standard evaluation protocol provided by the authors.

The comparison is restricted to approaches that take the raw RGB videos as input without further pre-processing, for instance, by providing optical flow or human pose. The term ECO\(_{NF}\) describes a network that gets N sampled frames as input. The term ECO\(_{En}\) refers to average scores obtained from an ensemble of networks with {16, 20, 24, 32} number of frames.

Table 1. Comparison to the state-of-the-art on UCF101 and HMDB51 datasets (over all three splits), using just RGB modality.


Table 2. Comparing performance of ECO with state-of-the-art methods on the Kinetics dataset.
Table 3. Comparison with state-of-the-arts on something-something dataset. Last row shows the results using both Flow and RGB.

5.1 Benchmark Comparison on Action Classification

The results obtained with ECO on the different datasets are shown in Tables 1, 2, and 3 and compare them to the state of the art. For UCF-101, HMDB-51, and Kinetics, ECO outperforms all existing methods except I3D, which uses a much heavier network. On Something-Something, it outperforms the other methods with a large margin. This shows the strong performance of the comparatively simple and small ECO architecture.

Table 4. Runtime comparison with state-of-the-art approaches using Tesla P100 GPU on UCF101 and HMDB51 datasets (over all splits). For other approaches, we just consider one crop per sample to calculate the runtime. We reported runtime without considering I/O.
Table 5. Accuracy and runtime of ECO and ECO Lite for different numbers of sampled frames. The reported runtime is without considering I/O.

5.2 Accuracy-Runtime Comparison

The advantages of the ECO architectures becomes even more prominent as we look at the accuracy-runtime trade-off shown in Table 4 and Fig. 4. The ECO architectures yield the same accuracy as other approaches at much faster rates.

Fig. 4.
figure 4

Accuracy-runtime trade-off on UCF101 for various versions of ECO and other state-of-the-art approaches. ECO is much closer to the top right corner.

Fig. 5.
figure 5

Examples from the Something-Something dataset. In this dataset, the temporal context plays an even bigger role than on other datasets, since the same action is done with different objects, i.e., the appearance of the object or background gives almost no cues about the action.

Previous works typically measure the speed of an approach in frames per second (fps). Our model with ECO runs at 675 fps (and at 970 fps with ECO Lite) on a Tesla P100 GPU. However, this does not reflect the time needed to process a whole video. This becomes relevant for methods like TSN and ours, which do not look at every frame of the video, and motivates us to report videos per second (vps) to compare the speed of video understanding methods.

ECO can process videos at least an order of magnitude faster than the other approaches, making it an excellent architecture for video retrieval applications.

Number of Sampled Frames. Table 5 compares the two architecture variants ECO and ECO Lite and evaluates the influence on the number of sampled frames N. As expected, the accuracy drops when sampling fewer frames, as the subsections get longer and important parts of the action can be missed. This is especially true for fast actions, such as “throw discus”. However, even with just 4 samples the accuracy of ECO is still much better than most approaches in literature, since ECO takes into account the relationship between these 4 instants in the video, even if they are far apart. Figure 6 even shows that for simple short-term actions, the performance decreases when using more samples. This is surprising on first glance, but could be explained by the better use of the network’s capacity for simple actions when there are fewer channels being fed to the 3D network.

Fig. 6.
figure 6

Effect of the complexity of an action on the need for denser sampling. While simple short-term actions (leftmost group) even suffer from more samples, complex actions (rightmost group) clearly benefit from a denser sampling.

ECO Vs. ECO Lite. The full ECO architecture yields slightly better results than the plain ECO Lite architecture, but is also a little slower. The differences in accuracy and runtime between the two architectures can usually be compensated by using more or fewer samples. On the Something-Something dataset, where the temporal context plays a much bigger role than on other datasets (see Fig. 5), ECO Lite performs equally well as the full ECO architecture even with the same number of input samples, since the raw processing of single image cues has little relevance on this dataset.

Fig. 7.
figure 7

Early action classification results of ECO in comparison to existing online methods [30, 31] on the J-HMDB dataset. The online version of ECO yields a high accuracy already after seeing a short part of the video. Singh et al. [30] uses both RGB and optical flow.

Table 6. Captioning results on Youtube2Text (MSVD) dataset.

5.3 Early Action Recognition in Online Mode

Figure 7 evaluates our approach in online mode and shows how many frames the method needs to achieve its full accuracy. We ran this experiment on the J-HMDB dataset due to the availability of results from other online methods on this dataset. Compared to these existing methods, ECO reaches a good accuracy faster and also saturates at a higher absolute accuracy.

5.4 Video Captioning

To show the wide applicability of the ECO architecture, we also combine it with a video captioning network. To this end, we use ECO pre-trained on Kinetics to analyze the video content and train the state-of-the-art Semantic Compositional Network [9] for captioning. We evaluated on the Youtube2Text (MSVD) dataset [11], which consists of 1,970 video clips with an average duration of 9 s and covers various types of videos, such as sports, landscapes, animals, cooking, and human activities. The dataset contains 80,839 sentences and each video is annotated with around 40 sentences.

Table 6 shows that ECO compares favorably to previous approaches across all popular evaluation metrics (BLEU [27], METEOR [23], CIDEr [37]). Even ECO Lite is already on-par with a ResNet architecture pre-trained on ImageNet. Concatenating the features from ECO with those of ResNet improves results further. Qualitative examples that correspond to the improved numbers are shown in Table 7.

Table 7. Qualitative results on MSVD. First row corresponds to the examples where ECO improved over SCN and the second row shows the examples where ECO decreased the quality compared to SCN. ECO\(_L\) refers to ECO\(_{Lite-16F}\), ECO to ECO\(_{32F}\), and ECO\(_R\) to ECO\(_{32F+resnet}\).

6 Conclusions

In this paper, we have presented a simple and very efficient network architecture that looks only at a small subset of frames from a video and learns to exploit the temporal context between these frames. This principle can be used in various video understanding tasks. We demonstrate excellent results on action classification, online action classification, and video captioning. The computational load and the memory footprint makes an implementation on mobile devices a viable future option. The approaches runs 10\(\times \) to 80\(\times \) faster than state-of-the-art methods.