Recently, the generation, storage and sharing of multimedia video data has increased at an astronomical rate. In 2012, over 100 h of videos was uploaded on YouTube every minuteFootnote 1. The multimedia data being shared covers a wide variety, ranging from homemade birthday videos to professionally produced comedy skits, and from woodworking tutorials to breaking news reports and analysis, etc. Although the storage and dissemination capacity of the network has grown exponentially, the development of automatic tools to search and retrieve this data has not kept pace, and by large, manual annotation and categorization is used for video search. Manual annotation, along with being expensive and slow, cannot express the rich content of video data. A layperson or an analyst might want to search the video not only based on the main topic (e.g., news report of a protest) but also based on the events taking place, the activities of the people and entities being viewed, the conversations taking place and the sounds recorded. To automatically detect, classify and index everything that can potentially be recorded is clearly a grand challenge and needs to be divided into manageable entities for research and development of search solutions.

One of the sub-challenges for video search is human activity and event detection. Recently, some success has been reported for large-scale object recognition, object tracking and human action detection using computer vision as well as in the automatic indexing of speech in well-defined scenarios using audio processing. But a far bigger challenge is to generalize such findings to fully unconstrained settings and significantly increase the types of searchable events and the accuracy of the retrieved results. To achieve this end, it is widely believed that exploiting multiple modalities (i.e., imagery, audio and text)—as opposed to single modalities (imagery only)—and accurately fusing information obtained from each modality for event detection is likely to help. The aim of this special issue is to promote the important topic of multi media event detection. The content of the papers published in this issue ranges from system papers focused on multi-modal frameworks, over fusion of audio–visual cues, to mono-modality event and behavior understanding, all of which are critical topics when aiming at pushing the current frontiers within Multimedia Event Detection.

In the first paper, Wei Tong et al. [1] present a method to learn a discriminative video representation that integrates audio, visual and text (from OCR and speech recognition) modalities for event recognition. This intermediate representation is a compact vector representation derived from multiple bag-of-words features, and is learned by minimizing a robust loss function. The framework can also use auxiliary information, i.e., videos not related to targeted event for intermediate representation learning. Experiments indicate that using the multi-modal intermediate representation gives better results as compared to using early or late fusion schemes to integrate the available modalities.

Myers et al. [2] present a comprehensive evaluation of features and multi-modal fusion schemes for events detection in videos. A thorough analysis of usefulness of modalities and features for detection of events is presented. It is also shown that a number of complex late fusion schemes like sparse mixture models, MAP weighting, multi-stage SVM, etc. do not outperform the simple arithmetic mean for fusion in large-scale experiments.

Jhuo et al. [3] present a new joint representation for audio–visual patterns by creating bi-modal words which represent both the audio and video signals. This is done by first separately extracting visual and audio words and then building a bipartite graph to create the bi-modal word codebook. Multiple codebooks are produced. The visual and audio words are then discretized again using different pooling techniques to produce the bi-modal words. The multiple codebooks give rise to multiple representations using bi-modal words and these are combined using multiple kernel learning for event classification. Experiments show significant improvements over methods using the individual visual and audio codewords.

Oh et al. [4] describe a system for multimedia event detection involving novel (video, audio and image) features at different granularities which are fused using a novel technique. A hierarchy of features from low-level features to high-level concepts is used. First, base features such as SIFT, HoG and MFCC are extracted. Mid-level features are created from these using classifiers such as support vector machines. These mid-level features may include objects such as “tree, computer”. A novel latent SVM is then used to find high-level concepts such as “skateboarding in garage” from the mid-level features. This latent SVM trained in an unsupervised manner provides the ability to model unobserved variables in training. Two novel score fusion functions are also described. Experiments are shown on the 2011 TRECVID MED dataset.

Marin-Jimenez et al. [5] introduce another interesting approach to fuse audio and visual information for human interaction recognition in unconstrained videos. The authors also adopt the bag of words representations for both audio and visual channel and evaluate several fusion methods. Several observations are discussed such as the superior performance of audio features over the visual image-based ones, which is interesting as people often only utilized visual features in the task of human interaction recognition.

Burghouts et al. [6] describe an approach that combines several interesting ideas for action detection, including negative training sample selection, a two-stage classification pipeline and the fusion of multiple features. Each of the ideas is able to lead considerable performance gains and the combination of them offers state-of-the-art results on popular benchmark datasets such as the IXMAS.

Spampinato et al. [7] investigate the interesting problem of fish behavior analysis with underwater camera, which is helpful for marine biologists who study underwater environment and climate conditions. A comprehensive system that integrates many techniques is developed to detect and track fishes. Recognition of two important behaviors of several popular fish species is then performed and promising results are reported.

John et al. [8] investigate the effects of single subspace vs. multiple subspace for human activity classification. The subspaces are learned using a Charting method, which is a very interesting and novel approach in this domain. Skeleton data is used as input to the classification system, where a layered pruning scheme reduces the complexity of the classification problem. The developed framework is evaluated on public benchmark datasets and state-of-the-art results are obtained.

Lee et al. [9] describe a system for detecting pedestrian abnormalities in outdoor video data. Reliable trajectories of pedestrians are captured by a multi-view detection and tracking approach. These trajectories are then analyzed and abnormal events, e.g., illegal entry and line forming, are detected in real-time. More sophisticated—and none real time—event detector validators are built on top of these real-time methods. The system is evaluated in different outdoor settings.

In the last paper in this special issue, Zhu et al. [10] deal with the problem of how a huge amount of video data can be summarized into a much shorter sequence. The approach uses the notion of video synopsis where spatio-temporal events are detected and stitched into one joined video sequence resulting in a highly compressed video sequence. The focus of the work is on avoiding overlapping events and hence obtain a less redundant video synopsis. The approach is evaluated on various real video sequences.

The guest editors would like to thank the authors for their efforts and interest. Moreover, we would also like to state how much we appreciate the valuable comments and suggestions provided by the reviewers. Lastly, a special thanks to the Editor-in-Chief Prof. Mubarak Shah and to the Editorial Coordinator Cherry Place for their support during the preparation and publication of this Special Issue.