The goal of multimedia event detection (MED) is to detect user-defined events of interest in massive, continuously growing video collections, such as those found on the Internet. This is an extremely challenging problem because the contents of the videos in these collections are completely unconstrained, and the collections include varying qualities of user-generated videos, often made with handheld cameras, and may have jerky motions, wildly varying fields of view, and poor lighting. The audio in these videos is recorded in a variety of acoustic environments, often with a single camera-mounted microphone, with no attempt to prevent background sounds from masking speech.

For purposes of this research, an event, as defined in the TRECVID MED evaluation task sponsored by the National Institute of Standards and Technology (NIST) [1], has the following characteristics:

  • It includes a complex activity occurring at a specific place and time.

  • It involves people interacting with other people and/or objects.

  • It consists of a number of human actions, processes, and activities that are loosely or tightly organized and have significant temporal and semantic relationships to the overarching activity.

  • It is directly observable.

Figure 1 shows some sample video imagery from events in the TRECVID MED evaluation task. Events are more complex and may include actions (hammering, pouring liquid) and activities (dancing) occurring in different scenes (street, kitchen). Some events may be process-oriented, with an expected sequence of stages, actions, or activities (making a sandwich or repairing an appliance); other events may be a set of ongoing activities with no particular beginning or end (birthday party or parade). An event may be observed in only a portion of the video clip, and relevant clips may contain extraneous content.

Fig. 1
figure 1

Key-frame series from example videos for the events making a sandwich, repairing an appliance, birthday party, and parade

Multimedia event detection can be considered as a search problem with a query-retrieval paradigm. Currently, videos in online collections, such as YouTube, are retrieved based on text-based search. Text labels are either manually assigned when the video is added to the collection or derived from text already associated with the video, such as text content that occurs near the video in a multimedia blog or web page. Videos are searched and retrieved by matching a text-based user query to videos’ text labels, but performance will depend on the quality and availability of such labels.

Highly accurate text-based video retrieval requires the text-based queries to be comprehensive and specific. In the TRECVID MED evaluation, each event is defined by an “event kit,” which includes a 150–400 word text description consisting of an event name, definition, explication (textual exposition of the terms and concepts), and lists of scenes, objects, people, activities, and sounds that would indicate the presence of the event. Figure 2 shows an example for the event working on a woodworking project. The user might also have to specify how similar events are distinguished from the event of interest (e.g., not construction in Fig. 2), and may have to estimate the frequency with which various entities occur in the event (e.g., often indoors). Subcategories and variations of the event may also have to be considered (e.g., operating a lathe in a factory).

Fig. 2
figure 2

Event Kit for working on a woodworking project

Another approach to detect events is to define the event in terms of a set of example videos, which we call an example-based approach. Example videos are matched to videos in the collection using the same internal representation for each. In this approach, the system automatically learns a model of the event based on a set of positive and negative examples, taking advantage of well-established capabilities in machine learning and computer vision. This paper considers an example-based approach with both non-semantic and semantic representations.

Current approaches for MED [27] rely heavily on kernel-based classifier methods that use low-level features computed directly from the multimedia data. These classifiers learn a mapping between the computed features and the category of event that occurs in the video. Videos and events are typically represented as “bag-of-words” (BOW) models composed of histograms of descriptors for each feature type, including visual, motion, and audio features. Although the performance of these models is quite effective, individual low-level features do not correspond directly to terms with semantic meaning, and therefore cannot provide human-understandable evidence of why a video was selected by the MED system as a positive instance of a specific event.

A second representation is in terms of higher-level semantic concepts, which are automatically detected in the video content [811]. The detectors are related to objects, like a flag; scenes, like a beach; people, like female; and actions, like dancing. The presence of concepts such as these creates an understanding of the content. However, except for a few entities such as faces, most individual concept detectors are not yet reliable [12]. In addition, training detectors for each concept requires annotated data, which usually involves significant manual effort to generate. In the future, it is expected that more annotated datasets will be available, and weakly supervised learning methods will help improve the efficiency of generating them. Event representations based on high-level concepts have started to appear in the literature [1316].

For an example-based approach, the central research issue is to find an event representation in terms of the elements of the video that permits the accurate detection of the events. In our approach, an event is modeled as a set of multiple bags-of-words, each based on a single data type. Partitioning the representation by data type permits the descriptors for each data type to be optimized independently (specific multimodal combinations of features, such as bimodal audiovisual features [3], can be considered a single data type within this architecture). The data types we used included both low-level features (visual appearance, motion, and audio) and higher-level semantic concepts (visual concepts). We also used automatic speech recognition (ASR) to generate a BOW model in which semantic concepts were expressed directly by words in the recognized speech. The resulting event model combined multiple sources of information from multiple data types and multiple levels of information.

As part of the optimization process for the low-level features, we investigated the use of difference coding techniques in addition to conventional coding methods. Because the information captured by difference coding is somewhat complementary to the information produced by the traditional BOW, we anticipated an improvement in performance. We conducted experiments to compare the performance of difference coding techniques with conventional feature coding techniques.

The remaining challenge is finding the best method for combining the multiple bags-of-words in the event-detection decision process. The most common approach is to apply late fusion methods [3, 5, 17] in which the results for each data type are combined by fusing the decision scores from multiple event classifiers. This is a straightforward way of using the information from all data types in proportion to their relative contribution to event detection on videos with widely diverse content. We evaluated the performance of several fusion methods.

The work described in this paper focused on evaluating the various data types and fusion methods for MED. Our approach for example-based MED, including methods for content extraction and fusion, is described in Sect. 2. Experimental results are described in Sect. 3, and Sect. 4 contains a summary and discussion.

All the experiments for evaluating the performance of the MED capability were performed using the data provided in the TRECVID MED evaluation task. The MED evaluation uses the Heterogeneous Audio Visual Internet Collection (HAVIC) video data collection [18], which is a large corpus of Internet multimedia files collected by the Linguistic Data Consortium.

Approach for example-based MED

The work in this paper focuses on SEarch with Speed and Accuracy for Multimedia Events (SESAME), an MED system in which an event is specified as a set of video clip examples. A supervised learning process trains an event model from positive and negative examples, and an event classifier uses the event model to detect the targeted event. An event classifier was built for each data type. The results of all the event classifiers were then combined by fusing their decision scores. An overview of the SESAME system and methods for event classification and fusion are described in the following sections.

SESAME system overview

The major components of the SESAME system are shown in Fig. 3. A total of nine event classifiers generate event detection decision scores: two based on low-level visual features, three based on low-level motion features, one based on low-level audio features, two based on visual concepts, and one based on ASR. The outputs of the event classifiers are combined by the fusion process.

Fig. 3
figure 3

Major components of the SESAME system

Figure 4 shows the processing blocks within each event classifier. Each event classifier operates on a single type of data and includes both training and event classification. Content is extracted from positive and negative video examples, and the event classifier is trained, resulting in an event model. The event model produces event detection scores when it is applied to a test set of videos. Figure 4 does not show off-line training and testing to optimize the parameter settings for the content extraction processes.

Fig. 4
figure 4

Example-based event classifier for MED

Content extraction methods

This section describes the feature coding and aggregation methods that were common to the low-level features and the content extraction methods for the different data types: low-level visual features, low-level motion features, low-level audio features, high-level visual features, and ASR.

Feature coding and aggregation

The coding and aggregation of low-level features share common elements that we describe here. We extracted local features and aggregated them by using three approaches: conventional BOW, vector of locally aggregated descriptors (VLAD), and Fisher vectors (FV).

The conventional BOW approach partitions low-level features into \(k\)-means clusters to generate a codebook. Given a set of features from a video, a histogram is generated by assigning each feature from the set to one or several nearest code words. Several modifications to this approach are possible. One variation uses soft coding, where instead of assigning each feature to a single code word, distances from the code words are used to weigh the histogram terms for the code words. Another variation describes code words by a Gaussian mixture model (GMM), rather than just by the center of a cluster.

While conventional BOW aggregation has been successfully used for many applications, it does not maintain any information about the distribution of features in the feature space. FV has been introduced in previous work [19] to capture more detailed statistics, and has been applied to image classification and retrieval [20, 21]. The basic idea is to represent a set of data by a gradient of its log-likelihood to model parameters and to measure the distance between instances with the Fisher kernel. For local features extracted from videos, it becomes natural to model their distribution as GMMs, forming a soft codebook. With GMM, the dimension of FV is linear in the number of mixtures and local feature dimensions.

Finally, VLAD [20] is proposed as a non-probabilistic version of FV. It uses \(k\)-means instead of GMM, and accumulates the relative positions of feature points to their single nearest neighbors in the codebook.

Compared with conventional BOW, FV and VLAD have the following benefits:

  • FV takes GMM as the underlying generative model.

  • Both FV and VLAD are derivatives, so feature points with the same distribution as the general model have no overall impact on the video-level descriptors; as a result, FV and VLAD can suppress noisy and redundant signals.

None of the above aggregation methods consider feature localization in space or in time. We introduced a limited amount of this information by dividing the video into temporal segments (for time localization) and spatial pyramids (for spatial localization). We then compute the features in each segment or block separately and concatenate the resulting features. The spatial pooling and temporal segmentation parameters that yielded the best performance were determined through experimentation.

Visual features

Two event classifiers were developed based on low-level visual features [22]. They both follow a pipeline consisting of four stages: spatiotemporal sampling of points of interest, visual description of those points, encoding the descriptors into visual words, and supervised learning with kernel machines.

Spatiotemporal sampling The visual appearance of an event in video may have a dependency on the spatiotemporal viewpoint under which it is recorded. Salient point methods [23] introduce robustness against viewpoint changes by selecting points, which can be recovered under different perspectives. To determine salient points, Harris–Laplace relies on a Harris corner detector; applying it on multiple scales makes it possible to select the characteristic scale of a local corner using the Laplacian operator. For each corner, the Harris–Laplace detector selects a scale invariant point if the local image structure under a Laplacian operator has a stable maximum.

Another solution is to use many points by dense sampling. For imagery with many homogenous areas, such as outdoor snow scenes, corners may be rare, therefore relying on a Harris–Laplace detector can be suboptimal. To counter the shortcomings of Harris–Laplace, we used dense sampling, which samples an image grid in a uniform fashion, using a fixed pixel interval between regions.

In our experiments, we used an interval distance of six pixels and sampled at multiple scales. Appearance variations caused by temporal effects were addressed by analyzing video beyond the key-frame level [24]. Taking more frames into account during analysis allowed us to recognize events that were visible during the video, but not necessarily in a single key frame. We sampled one frame every 2 s. Both Harris–Laplace and dense sampling give an equal weight to all keypoints, regardless of their spatial location in the image frame. To overcome this limitation, Lazebnik et al. [25] suggested repeated sampling of fixed subregions of an image, e.g., 1 \(\times \) 1, 2 \(\times \) 2, 4 \(\times \) 4, etc., and then aggregating the different resolutions into a spatial pyramid, which allows for region-specific weighting. Since every region is an image in itself, the spatial pyramid can be combined with both the Harris–Laplace point detector and dense point sampling. We used a spatial pyramid of 1 \(\times \) 1 and 1 \(\times \) 3 regions in our experiments.

Visual descriptors In addition to the visual appearance of events in the spatiotemporal viewpoint under which they are recorded, the lighting conditions during recording also play an important role in MED. Properties of color features under classes of illumination and viewing features, such as viewpoint, light intensity, light direction, and light color, can change, specifically for real-world datasets as considered within TRECVID [26]. We followed [22] and used a mixture of SIFT, OpponentSIFT, and C-SIFT descriptors. The SIFT feature proposed by Lowe [27] describes the local shape of a region using edge-orientation histograms. Because the SIFT feature is normalized, the gradient magnitude changes have no effect on the final feature. OpponentSIFT describes all the channels in the opponent color space using SIFT features. The information in the O3 channel is equal to the intensity information, while the other channels describe the color information in the image. The feature normalization, as effective in SIFT, cancels out any local changes in light intensity. In the opponent color space, the O1 and O2 channels still contain some intensity information. To add invariance to shadow and shading effects, the C-invariant [28] eliminates the remaining intensity information from these channels. The C-SIFT feature uses the C-invariant, which can be seen as the gradient (or derivative) for the normalized opponent color space O1/I and O2/I. The I intensity channel remains unchanged. C-SIFT is known to be scale-invariant with respect to light intensity. We computed the SIFT and C-SIFT descriptors around salient points obtained from the Harris–Laplace detector and dense sampling. We then reduced all descriptors to 80 dimensions with principal component analysis (PCA).

Word encoding To avoid using all low-level visual features from a video, we followed the well-known codebook approach. We first assigned the features to discrete codewords from a predefined codebook. Then, we used the frequency distribution of the codewords as a compact feature vector representing an image frame. Based on [22], we employed codebook construction using \(k\)-means clustering in combination with average codeword assignment and a maximum of 4,096 codewords. The traditional hard assignment can be improved using soft assignment through kernel codebooks [29]. A kernel codebook uses a kernel function to smooth the hard assignment of (image) features to codewords by assigning descriptors to multiple clusters weighted by their distance to the center. We also used difference coding, with VLAD performing \(k\)-means clustering of the PCA-reduced descriptor space with 1,024 components. The output of the word encoding is a BOW vector using either hard average coding or soft VLAD coding. The BOW vector forms the foundation for event detection.

Kernel learning Kernel-based learning methods are typically used to develop robust event detectors from audiovisual features. As described in [22], we relied predominantly on the support vector machine (SVM) framework for supervised learning of events: specifically, the LIBSVMFootnote 1 implementation with probabilistic output. To handle imbalance in the number of positive versus negative training examples, we fixed the weights of the positive and negative classes by estimating the prior probabilities of the classes on training data. We used the histogram intersection kernel and its efficient approximation as suggested by Maji et al. [30]. For difference coded BOWs, we used a linear kernel [19].

Experiments We evaluated the performance of these two event classifiers on a set of 12,862 drawn from the training and development data from the TRECVID MED evaluation. This SESAME Evaluation dataset consisted of a training set of 8,428 videos and a test set of 4,434 videos sampled from 20 event classes and other classes that did not belong to any of the 20 events. To make good use of the limited number of available positive instances of events, the positives were distributed so that, for each event, there were approximately twice as many positives in the training set as there were in the test set. Separate classifiers were trained for each event based on a one-versus-all paradigm. Table 1 shows the performance of the two event classifiers measured by mean average precision (MAP). Color-average coding with a histogram intersection kernel (HIK) SVM slightly outperformed color-difference soft coding with a linear SVM. For events, such as changing a vehicle tire and town hall meeting, the average HIK was the best event representation. However, for some events, such as flash mob gathering and dog show, the difference coding was more effective. To study whether the representations complement each other, we also performed a simple average fusion; the results indicate a further increase in event detection performance, improving mean average precision from 0.342 to 0.358 and giving the best overall performance for the majority of events.

Table 1 Mean average precision (MAP) of event classifiers with low-level visual features and their fusion for 20 TRECVID MED evaluation event classes

Motion features

Many motion features for activity recognition have been suggested in previous work; [4] provides a nice evaluation of motion features for classifying web videos on the NIST MED 2011 dataset. Based on our analysis of previous work and some small-scale experiments, we decided to use three features: spatiotemporal interest points (STIPs) and dense trajectories (DTs) [31], and MoSIFT [32]. STIP features are computed at corner-like locations in the 3D spatiotemporal volume. Descriptors consist of histograms of gradient and optical flow at these points. This is a very commonly used descriptor; more details may be found in [33]. Dense trajectory features are computed on a dense set of local trajectories (typically computed over 15 frames). Each trajectory is described by its shape and by histograms of intensity gradient, optical flow, and motion boundaries around it. Motion boundary features are somewhat invariant to camera motion. MoSIFT, as its name suggests, uses SIFT feature descriptors; its feature detector is built on motion saliency. STIP and DT were extracted using the default parameters as providedFootnote 2; the MoSIFT features were obtained in the form of coded BOW features.Footnote 3

After the extraction of low-level motion features, we generated a fixed-length video-level descriptor for each video. We experimented with the coding schemes described in Sect. 2.2.1 for the STIP and DT features; for MoSIFT, we were able to use BOW features only. We used the training and test sets described above.

We trained separate SVM classifiers for each event and each feature type. Training was based on a one-versus-all paradigm. For conventional BOW features, we used the \(\chi 2\) kernel. We used the Gaussian kernel for VLAD and FV. To select classifier-independent parameters (such as the codebook size), we conducted fivefold cross validation of 2,062 videos from 15 event classes. We conducted fivefold cross validation on the training set to select classifier-dependent parameters. For BOW features, we used 1,000 codewords; for FV and VLAD, we used 64 cluster centers. More details of the procedure are found in [34].

We compared the performance of conventional BOW, FV, and VLAD for STIP features; BOW and FV for DT features; and BOW for MoSIFT, using the SESAME Evaluation dataset. Table 2 shows the results.

Table 2 Mean average precision of event classifiers with motion features for 20 TRECVID MED evaluation event classes

We can see that FV gave the best MAP for both STIP and DT. VLAD also improved MAP for STIP, but was not as effective as the FV features. We were not able to perform VLAD and FV experiments for MoSIFT features, but would expect to have seen similar improvements there.

Audio features

The audio is modeled as a bag-of-audio-words (BOAW). The BOAW has recently been used for audio document retrieval [35] and copy detection [36], as well as MED tasks [37]. Our recent work [38] describes the basic BOAW approach. We extracted the audio data from the video files and converted them to a 16 kHz sampling rate. We extracted Mel frequency cepstral coefficients (MFCCs) for every 10 ms interval using a hamming window with 50 % overlap. The features consist of 13 values (12 coefficients and the log-energy), along with their delta and delta–delta values. We used a randomized sample of the videos from the TRECVID 2011 MED evaluation development set to generate the codebook. We performed \(k\)-means clustering on the MFCC features to generate 1,000 clusters. The centroid for each cluster is taken as a code word. The soft quantization process used the codebook to map the MFCCs to code words. We trained an SVM classifier with a histogram intersection kernel on the soft quantization histogram vectors of the video examples, and used the classifier to detect the events. Evaluation with the SESAME Evaluation dataset showed that the audio features achieved a MAP of 0.112.

Visual concepts

Two event classifiers were based on concept detectors. We followed the pipeline proposed in [39]. We decoded the videos by uniformly extracting one frame every 2 s. We then applied all available concept detectors to the extracted frames. After we concatenated the detector outputs, each frame was represented by a concept vector. Finally, we aggregated the frame representations into a video-level representation by averaging and normalization. On top of this concept representation per video, we used either a HIK SVM or a random forest as an event classifier.

To create the concept representation, we needed a comprehensive pool of concept detectors. We built this pool of detectors using the human-annotated training data from two publicly available resources: the TRECVID 2012 Semantic Indexing task [40] and the ImageNet Large-Scale Visual Recognition Challenge 2011 [41]. The former has annotations for 346 semantic concepts on 400,000 key frames from web videos. The latter has annotations for 1,000 semantic concepts on 1,300,000 photos. The categories are quite diverse and include concepts from various types; i.e., objects like helicopter and harmonica, scenes like kitchen and hospital, and actions like greeting and swimming. Leveraging the annotated data available in these datasets, we trained 1,346 concept detectors in total.

We followed the state-of-the-art for our implementation of the concept detectors. We used densely sampled SIFT, OpponentSIFT, and C-SIFT descriptors, as we had for our event detector using visual features, but this time, we used difference coding with FV [19]. We used a visual vocabulary of 256 words. We again used the full image and three horizontal bars as a spatial pyramid. The feature vectors representing the training images formed the input for a linear SVM.

Experiments with the SESAME Evaluation dataset, summarized in Table 3, show that the random forest classifier is more successful than the non-linear HIK SVM for event detection using visual concepts, although the two approaches are quite close on average. Note that the event detection results using visual concepts are close to our low-level representation using visual or motion features.

Table 3 Mean average precision of event classifiers with visual concept features for 20 TRECVID MED evaluation event classes

Automatic speech recognition

Spoken language content is often present in user-generated videos and can potentially contribute useful information for detecting events. The recognized speech has direct semantic information that typically complements the information contributed by low-level visual features. We used DECIPHER, SRI’s ASR software, to recognize spoken English. We used acoustic and language models obtained from an ASR system [42] trained on speech data recorded in meetings with a far-field microphone. Initial tests on the audio in user-generated videos revealed that the segmentation process, which distinguishes speech from other audio, often misclassified music as speech. Therefore, before running the speech recognizer on these videos, we constructed a new segmenter, which is described below.

The existing segmenter was GMM based and had two classes (speech and non-speech). For this effort, we leveraged the availability of annotated TRECVID video data and built a segmenter better tuned to audio conditions in user-generated videos. We built a segmenter with four classes: speech, music, noise, and pause. We measured the effectiveness of the new segmentation by the word-error rates (WERs) obtained by feeding the speech-segmented audio to our ASR system. We found that the new segmentation helped reduce the WER from 105 to 83 %. This confirmed that the new segmentation models were a better match to the TRECVID data than models trained on meeting data. For reference, when all the speech segments were processed by the ASR, the WER obtained by our system was 78 % (this oracle segmentation provided the lowest WER that could be achieved by improving the segmentation).

To create features for the event classifiers, we used ASR recognition lattices to compute the expected word counts for each word and each video. This approach provided significantly better results compared to using the 1-best ASR output, because it compensated for ASR errors by including words with lower posteriors that were not necessarily present in the 1-best. We computed the logarithm of the counts for each word, appended them to form a feature vector of dimension 34,457, and used a linear SVM for the event classifiers. More details may be found in [43]. Evaluation with the SESAME Evaluation dataset showed that the ASR event classifiers achieved a MAP of 0.114.


We implemented a number of fusion methods, all of which involved a weighted average of the detection scores from the individual event classifiers. The methods for determining the weights considered several factors:

  • Event dependence and learned weights Because the set of most reliable data types for different events might vary, we considered the importance of learning the fusion weights for each event using a training set. However, when there is limited data available for training, aggregating the data for all events and computing a fixed set of weights for all events may yield more reliable results. Another strategy is to set the weights without training with any data at all. For example, in the method of fusing with the arithmetic mean of the scores, all the weights are equal.

  • Score dependence For weights learned via cross-validation on a training set, a single set of fixed weights might be learned for the entire range of detection scores. Alternatively, the multidimensional space of detection scores might be partitioned into a set of regions, with a set of weights assigned to each region. In general, more data is needed for score-dependent weights to avoid overfitting.

  • Adjustment for missing scores When the scores for some types of data (particularly for ASR and MFCC) are missing, a default value, such as an average for the missing score, might be used, but this could provide a misleading indication of contribution. Other ways of dealing with missing scores include renormalizing the weights of the non-missing scores, or learning multiple sets of weights, each set for a particular combination of non-missing scores.

We evaluated the fusion models described below. All the models operated on detection scores were normalized using a Gaussian function (i.e., computing the \(z\) score by removing the mean and scaling by the standard deviation)

Arithmetic mean (AM) In this method, we compute the AM of the scores of the observed data types for a given clip. Missing data types for a given clip are ignored, and the averaging is performed over the scores of observed data types.

Geometric mean (GM) In this method, we compute the uniform GM of the scores of the observed data types for a given clip. As we do for AM, we ignore missing data types and compute the geometric mean of the scores from observed data types.

Mean average precision-weighted fusion (MAP) This fusion method weighs scores from the observed data types for a clip by their normalized average precision scores, as computed on the training fold. Again, the normalization is performed only over the observed data types for a given clip.

Weighted mean root (WMR) This fusion method is a variant of the MAP-weighted method. In this method, we compute the fusion score as we do for MAP-weighted fusion, except the final fused score \({{\varvec{x}}}^{\prime }\) is determined by performing a power normalization of the MAP-based fused score \({{\varvec{x}}}\):

$$\begin{aligned} {{\varvec{x}}}^{\prime }={{\varvec{x}}}^{\frac{1}{\alpha }} \end{aligned}$$

where \(\alpha \) is the number of non-missing data types for that video. In other words, the higher the number of data types from which the fusion score is computed, the more trustworthy the output.

Conditional mixture model This model combines the detection scores from various data types using mixture weights that were trained by the expectation maximization (EM) algorithm on the labeled training folds. For clips that are missing scores from one or more data types, we provide the expected score for that data type based on the training data.

Sparse mixture model (SMM) This extension of the conditional mixture model addresses the problem of missing scores for a clip by computing a mixture for only the observed data types [44]. This is done by renormalizing the mixture weights over the observed data types for each clip. The training was done with the EM algorithm, but the maximization step no longer had a closed-form solution, therefore we used gradient-descent techniques to learn the optimal weights.

SVMLight This fusion model consists of training an SVM using the scores from various data types as the features for each clip. Missing data types for a given clip are assigned zero scores. We used the SVMLightFootnote 4 implementation with linear kernels.

Distance from threshold This is a weighted averaging method [3] that dynamically adjusts the weights of each data type for each video clip based on how far the score is from its decision threshold. If the detection score is near the threshold, the correct decision is presumed to be somewhat uncertain, and a lower weight is assigned. A detection score that is much greater or much lower than the threshold indicates that more confidence should be placed in the decision, and a higher weight is assigned.

Bin accuracy weighting This method tries to address the problem of uneven distribution of detection scores in the training set. For each data type, the range of scores in the training fold is divided into bins with approximately equal counts per bin. During training, the accuracy of each bin is measured by computing the proportion of correctly classified videos whose scores fall within the bin. During testing, for each data type, the specific bin that the scores fall into is determined, and the corresponding bin accuracy scores for each data type are used as fusion weights.

Table 4 summarizes the fusion methods and their characteristics.

Table 4 Fusion methods and their characteristics

Experimental results

We evaluated the performance of our SESAME system using the data provided in the TRECVID MED evaluation task. Although the MED event kit contained both a text description and video examples for each event, the SESAME system implemented the example-based approach in which only the video examples were used for event detection training.

Evaluation by data type

Table 5 lists results on the SESAME Evaluation dataset. In terms of the performance of the various data types, the visual features were the strongest performers across all events. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. The motion features also showed strong performance. Although the performance of low-level audio features and ASR was significantly less, ASR had the highest performance for events containing a relatively large amount of speech content, including a number of instructional videos. The best scores for each event are distributed among all of the data types, indicating that fusion of these data should yield improved performance. Indeed, the AM fusion of the individual event classifiers, which is listed in the last column of Table 5, shows a significant boost in performance: a 33 % improvement over the best single data type.

Table 5 Experiment results in terms of mean average precision for individual event classifiers

Evaluation of fusion methods

We tested the late fusion methods described in Sect. 2.3 using the SESAME Evaluation dataset. For all our fusion experiments, we trained each event classifier on the training set, and executed the classifier on the test set to produce detection scores for each event. To produce legitimate fusion scores over the test set, we used tenfold cross validation, with random fold selection, to generate the detections, and then obtained a micro-averaged average precision over the resulting detections. The micro-averaged MAP was computed by averaging the average precision for each event. To gauge the stability of the fusion methods, we repeated this process 30 times and computed the macro average and standard deviation of the micro-averaged MAPs. Because the AM and GM methods are untrained, their micro-averaged MAPs will be the same regardless of fold selection; thus, the standard deviations for their micro-averaged MAPs are zero.

Table 6 shows the MED performance of various fusion methods. The comparison indicates that the simplest fusion methods, such as AM and GM, performed as well as or better than other, more complex fusion methods. Also note that most of the top-performing fusion methods (AM, GM, MAP, WMR, and SMM) adjusted their weights to accommodate missing scores.

Table 6 MED performance of fusion methods with all event classifiers

Evaluation of MED performance in TRECVID

As the SESAME team, we participated in the 2012 TRECVID MED evaluation and submitted the detection results for a system configured nearly the same as that described in this paperFootnote 5. The event classifiers were trained with all the positives from the event kit and negatives from the TRECVID MED training and development material. The test set consisted of the 99,000 videos used in the formal evaluation.

Figure 5 shows the performance of the primary runs of 17 MED systems in this evaluation in terms of miss and false alarm rates [45]. The performance of the SESAME run was one of the best among the evaluation participants.

Fig. 5
figure 5

Performance of the primary runs of 17 MED systems in the 2012 TRECVID MED evaluation

Summary and discussion

SEarch with Speed and Accuracy for Multimedia Events, a MED capability that learns event models from a set of example video clips, includes a number of BOW event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and ASR. Partitioning the representation by data type permits the descriptors for each data type to be optimized independently. We evaluated the detection performance for each event classifier and experimented with a number of fusion methods for combining the event detection scores from these classifiers. Our experiments using multiple data types and late fusion of their scores demonstrated strongly reliable MED performance.

Major conclusions from this effort include:

  • The relative contribution of visual, motion, and audio features varies according to the specific event. This is due to differences in the relative distinctiveness and consistency of certain features for each event category. Across all events, score-level fusion resulted in a 33 % improvement over the best single data type, indicating that different types of features contribute to the representation of heterogeneous video data.

  • The use of difference coding in low-level visual and motion features significantly improved performance. We surmise that difference coding works better than the traditional BOW because it measures differences from the general model, which is likely to be dominated by the background features. We expect additional gains in performance if difference coding was applied to low-level audio features.

  • The set of 1,346 high-level visual features was nearly as effective as the set of low-level visual features. It appears that, in comparison to the 5,000 or so concepts predicted to be needed for sufficient performance in event detection [46], this number of high-level features begins to span the space of concepts reasonably well. Therefore, analogous sets of motion and audio concepts should further improve the overall performance.

  • Although the performance of ASR was lower than that of the visual and motion features, its performance was highly event dependent, and it performed reasonably well for events containing a relatively large amount of speech content, such as instructional videos.

  • The simplest fusion methods for computing event detection scores were very effective compared to more complex fusion methods. One possible explanation for this is that the reliability of the scores is roughly equal across all data types. Another possible reason is that the limited number of positive training examples (an average of about 70 per event) is not enough to achieve the full benefit of the more complex fusion models.

While our relatively straightforward BOW approach was quite effective, we view it as a baseline capability that could be improved in several ways:

  • Since the current approach aggregates low-level visual and motion features within fixed spatial partitions, the usage of local information is limited. Features of an object divided by our predefined partition, for example, will not be aggregated as a whole. We expect that the use of dynamic spatial pooling, which is better aligned to the structure and content of the video imagery, will improve performance. Segmenting the image into meaningful homogeneous regions would be even better, as it allows more salient characteristics to be extracted, and would eventually lead to better classification.

  • The current approach ignores the temporal information within each video clip; all the visual, motion, and audio features are aggregated. However, events consist of multiple components that appear at different times, therefore using time-based information for event modeling and detection should improve performance. In addition, aggregating low-level features according to the temporal structure of the video may yield feature sets that better represent the video contents.

  • All the classifiers in our approach operate on a histogram of features and do not leverage any relationships between the features. Features occurring in video data are not generally independent. In particular, the combination of particular high-level semantic concepts could become strong discriminatory evidence, since their co-occurrence might be associated with a subset of relevant video content. For example, although the concepts balloons and singing occur in many contexts, the occurrence of both might be more common to birthday party than to other video content. Exploiting the spatiotemporal dependencies among the features would better characterize the video contents and offer a richer set of data with which to build event models.