Advertisement

Multimedia Tools and Applications

, Volume 76, Issue 1, pp 1331–1353 | Cite as

Efficient copy detection for compressed digital videos by spatial and temporal feature extraction

  • Po-Chyi Su
  • Chin-Song Wu
Article

Abstract

This research aims at developing a practical video copy detection mechanism to determine whether an investigated video is a duplicated copy that may infringe the intellectual property rights. The significant features of original videos are extracted and stored in the server. Given an uploaded video, the same feature is extracted and compared with the stored ones to seek a possible match. Both the spatial and temporal features of compressed videos are employed in the proposed scheme. The scene-change detection is applied to select the key frames, from which the robust spatial features are extracted to help search visually similar frames. The shot lengths are used as the temporal features to further ensure the matching accuracy. To ensure that the proposed method is practical in considered applications, the size of stored features in the server, efficiency and accuracy of matching features are the major design principles. The experimental results by testing a large number of compressed videos demonstrate the feasibility of the proposed scheme.

Keywords

Copy detection Video coding Feature extraction Content management Multimedia databases Copyright protection 

1 Introduction

Digital videos are distributed widely these days on various kinds of media thanks to the proliferation of cheaper but increasingly powerful personal computers, the prevalence of high-speed networking facilities and the advanced video coding technologies. Many multimedia service providers develop convenient platforms for uploading and sharing digital videos. These platforms further facilitate the spread of digital content. Although users surely enjoy such convenience and have fun watching a variety of videos, the content owners may not always be supportive since many videos may be presented without their permission or infringe the intellectual property rights. The service providers may be requested to remove certain video clips or even be sued for the copyright violation. Therefore, the issues of copyright protection become critical for the video servers or service providers to reduce such controversies or disputes.

As there are a large number of video clips uploaded to the video servers every day, finding the video clips that may violate copyrights is a challenging task. Digital watermarking and content matching are two potential solutions. In digital watermarking, the video content provider may embed an imperceptible signal, i.e., the digital watermark, into their distributed videos to claim the ownership. The video server will be provided with watermark detectors to investigate whether a video clip, once uploaded, belongs to the copyrighted material by checking the existence of watermark signal. The general requirements of digital watermark for copy detection include imperceptibility, low false detection rates and robustness against content-preserving processing. Although the methodology of digital watermarking seems elegant, there exist certain problems. First of all, it is not possible to embed the watermark into already distributed video contents so only some videos, probably newly issued, will be protected. Besides, the content providers may have different opinions about the requirements/techniques of digital watermark so varying watermarking approaches may be employed. The watermark detection will become less efficient since many watermark detection procedures (from varying content providers) have to be applied on each uploaded video clip. The standardization of digital watermarking may be a solution but has been proven to be less successful. To solve these problems, some popular video servers propose their own watermark for content providers to embed in their distributed work. Nevertheless, the video content providers are reluctant to embed such signals as they are afraid that the content may be controlled by the video servers in the future. Furthermore, the extra cost of watermark embedding and possible quality degradation of video content are another hinders to the acceptance of watermarking techniques by content providers.

In the methodology of content matching, the uploaded video clip will be matched with the archived data to identify whether it is a copy. Compared with digital watermarking, the content based approach has an edge in that no such additional procedures as watermark embedding have to be applied before the distribution of the original content. The quality of the distributed content will not be affected at all. Besides, a lot more videos will be protected since the matching processes can theoretically be applied as long as the information about the copyrighted material is available to the video servers. However, it is impractical for the video server to possess all of the original videos due to the reluctance of sharing from the content owners and the requirement of an extremely large storage. The massive volume of videos in the database will also make the tracing of video content difficult. The large number of videos uploaded everyday further aggravates this problem. Furthermore, the video providers cherish their video contents so much and they will not be willing to share these digital assets with the video web servers. To achieve the practical and efficient video content matching, one solution is to extract/store the significant features from videos for comparing, instead of the videos themselves. The data volume of features will be significantly smaller than that of videos so the storage problem can be alleviated. The content providers can process their videos with the same feature extraction methods and then offer the extracted information to the video servers for the copy detection.

Existing methods rely on extracting the spatial features from images and temporal features from videos as the reliable content “hash”. Since the visually similar contents should have similar color statistics, which are usually preserved after transcoding or other processes, the color histogram is an important feature for content-based retrieval [7, 11, 23]. Hsu et al. [9] proposed to divide the video frames into sub-images according to the image content and the local color histogram for each partition are used for content matching. The major problem of using the color histogram only is the higher rate of “collision” as different contents may have similar colors. Explicitly combine the image hashing techniques and the video structure may be a preferred approach. For example, the video shot boundaries are detected so that the key frames can be chosen for the subsequent processes. A long video can be shrunk into a sub-video with a smaller size without losing the important information. Many algorithms have been proposed for video shot extraction [4] and can be roughly divided into two categories, i.e. the pixel domain and compression domain approaches. In the pixel domain approaches, the histogram and edge difference can be used to find the adjacent shots with apparent color differences. In the compression domain approaches, such features as DC values, the number of intra/inter coded macroblocks and motion vectors are exploited. The advantage is that the expansion of the encoded video to full frames may be avoided. For the key frame extraction, Zhuang [31] proposed to employ the idea of clustering. The key frames are chosen after combining similar shots and dividing them into clusters according to the frame sequence in the shots. Wolf [27] exploited the optical flow in shots to retrieve the key frames. Wang et al. [26] selected the key frames from the compressed domain and used the large motion intensity with location to increase the accuracy. They made use of the following two rules. First, the key frame should have the large motion intensity and, second, the spatial distribution of motion should focus on the center of a shot rather than dispersing around the two ends. The rough set theory [24] is further used to improve the performance. Liu et al. [17] proposed a so-called “Perceived Motion Energy” model to find the frames with the energy peak as the key frames. Wu et al. [28] proposed to use the shot lengths as the temporal features. They employed a fast matching algorithm, i.e. suffix string matching, so that the detection can be done in an efficient way. Only the temporal information will be used and no additional procedures, such as extracting DC values and color histogram, are adopted. Roover et al. [5] selected the key frame set and then calculated the radial project vectors to form the 180 vectors for the calculation of Discrete Cosine Transform (DCT). Coskun et al. [3] used the spatial-temporal transform to explicitly employ the temporal data. They applied the spatial and temporal domain normalization and then computed the 3D-DCT to include the temporal information for constructing the hash. Zargari et al. [29] proposed a compressed-domain feature extraction approach for image indexing or retrieval for H.264/AVC. The histogram of spatial predictions in the intra coding is employed as the descriptors. The rotation and scale are also taken into account to increase the robustness. Many methods rely on the extraction of interest points for generating the video summary or the near-duplicate video retrieval [1, 16]. Ling et al. [14] proposed a multi-scale descriptor for image copy detection. A series of SIFT [18] descriptors are extracted and Principle Component Analysis is applied to form the binary codes as the fingerprints for indexing. Kang et al. [10] also proposed a secure SIFT-based approach based on the sparse representation for image copy detection and recognition. Zhou et al. [30] employed the interest point selection for the video shot segmentation. Lu et al. [19] proposed a bag-of-importance model to extract important local descriptors or features. Song [21] presented a multiple feature hashing approach to tackle both the accuracy and the scalability issues in near-duplicate video retrieval. Considering that the computational load of matching SIFT descriptors is high, Liu et al. [15] employed Singular Value Decomposition (SVD) to reduce the dimension and proposed a graph-based method for video segmentation.

Although there have been quite a few existing video copy detection methods, most of them resort to extracting the features from raw videos to ensure the matching precision. Such a methodology may significantly increase the storage and computational load of servers. One may think that the process of detecting a video copy is proceeded in servers, not clients, so complexity-related issues may be less critical. However, considering that many videos are uploaded at every moment, an efficient video copy detection is still a necessary requirement to make the solution really practical. Besides, since all the uploaded videos are compressed and their resolutions are becoming larger these days, the servers may have been busy in processing these coded videos already. A solution without the need of extracting all the compressed data is thus a more feasible choice for the video servers. Therefore, the objective of this research is developing effective and efficient feature extraction for content-based video copy detection. In the proposed scheme, both the spatial and temporal features will be employed and they are designed to be robust to facilitate the content matching. In addition to the higher efficiency achieved by the design taking account of the compressed-domain data, the size of extracted features is also controlled to accommodate applications with a large video database, in which the video are compressed with H.264/AVC. The reminder of the paper is organized as follows. The proposed scheme will be detailed in Section 2, followed by the experimental results in Section 3 to demonstrate the feasibility. The conclusion will be presented in Section 4.

2 The proposed scheme

Figure 1 shows the framework of the proposed scheme. The content providers send the extracted features from their original videos, which are usually encoded, to the server or service provider. When a user uploads an encoded video to share with others, the video features are extracted and compared with those archived in the feature database in the server to seek a possible match. If a copy is found, the video may be removed automatically. The spatial and temporal feature extraction process consists of a few steps. The scene-change detection will first be applied to divide the video into shots. Since the H.264/AVC is considered to be used for encoding videos, the data in the compressed bit-stream are examined to find the scene-change frames efficiently. The key frames will be selected from the detected scene-change frames for calculating the robust spatial features, which should resist some signal processing or transcoding procedures. The spatial features also serve as the indices for matching the contents. The lengths of shots around the key frames will be recorded as the temporal features, which help to ensure the accurate content matching. The designs of the proposed scheme are detailed as follows.
Fig. 1

The framework of the proposed scheme

2.1 The scene-change detection

Given the assumption that the videos owned by the content providers and users are compressed by H.264/AVC, the coding modes in frames will be examined first for the scene-change detection to achieve the efficiency. Instead of pursuing the absolutely accurate scene-change detection, the objective here is to select the same frames from the original video and the possibly transcoded video. Clean and sharp scene-changes are the searched targets. The proposed scheme basically evaluates the differences of coding modes in frames, which can be acquired directly from the compressed bit-stream, to choose the candidates for the subsequent processing. The scenes containing faster motions will be ignored instead as they may cause ambiguity. We examine the two I-frames, I i and I j , from the adjacent Group of Pictures (GOP), G O P i and G O P j , respectively, and compare the intra coding codes of macroblocks, which are classified into one of the three types, 1) including the 16 × 16 prediction only, 2) containing 16 × 8, 8 × 16, 16 × 8, or 8 × 8 predictions and 3) containing 4 × 4 predictions. The number of matched macroblock types is viewed as the rough similarity measurement of the two frames and a pretty strict condition is set to avoid wrongly recognizing two visually different frames as being in one shot. Since a scene may contain camera motions, which may generate different intra coding modes in frames of a shot, the two I frames are extracted to compare their luminance histogram based on the Bhattacharyya distance,
$$ D_{s}(I_{i},I_{j})=\sqrt{1-\frac{{\sum}_{k=0}^{255}{\sqrt{H_{i}(k) \times H_{j}(k)}}}{\sqrt{{\sum}_{k=0}^{255}{H_{i}(k)} \times {\sum}_{k=0}^{255}{H_{j}(k)}}}} $$
(1)
where H i (k) (H j (k)) is the k t h bin of the luminance histogram of the frame I i (I j ). D s (I i , I j ) larger than a threshold T I , which is empirically set as 0.25, indicates that these two frames are quite different and a scene change may occur between I i and I j . To find the scene-change frame, we further calculate the percentage of macroblocks that are intra-coded, denoted by \(Pr_{P}^{(I)}\), in every P frame in G O P i . The P frame with the largest \(Pr_{P}^{(I)}\), denoted by \(\tilde {P}\), is chosen and compared with the other threshold T P , which is empirically set as 0.75. \(\tilde {P}\) will be selected as the scene-change frame if \(Pr_{\tilde {P}}^{(I)}\geq T_{P}\). Otherwise, I j will be chosen. If I i and I j are separated far apart, especially in the case of employing a flexible GOP structure, instead of a fixed GOP size, all the P frames with \(Pr_{P}^{(I)}\geq T_{P}\) will be considered scene-change frames. However, as mentioned before, the shot containing large motions may be wrongly identified as a scene change. To exclude such cases, we ignore the chosen frame if there are more than two P frames that are adjacent to the candidate with their \(Pr_{P}^{(I)}\) larger than T L M = 0.5. Figure 2 shows the flowchart of the scene-change detection process for two given I frames, I i and I j .
Fig. 2

The flowchart of the scene-change detection for two given I frames, I i and I j

A possible drawback of this compressed-domain methodology is that scene changes will only be detected in I and P frames. When B frames are used in the codec and a scene change frame is encoded as a B frame, a miss could happen. Nevertheless, it is not recommended to encode several consecutive frames as B frames to avoid possible quality degradation and the buffering issues. “ …B B P B B P…” is the most commonly used structure so the difference of detected scene change frame in the proposed scheme and the exact one may be at most two. Therefore, although a miss may not happen, the lack of accuracy by one or two frames is possible and has to be tackled as explained later.

It should be noted that only the active areas in video frames should be analyzed. In many videos, the letter box or static boundaries may be superimposed and such regions may not only complicate the setting of appropriate thresholds but also generate unstable spatial features, which will be described later. Therefore, the active area with the content should be identified beforehand. A straightforward method is to extract a few I-frames that are distant from each other and contain different contents. The differences of frames are calculated and an average frame is formed. The vertical and horizontal projections on this frame will reveal the existence of static boundaries and the active area that will be further analyzed. The usage of such thresholds as T I and T P should be related to the active area, instead of the entire frame, so that more accurate detections can be achieved.

2.2 The key-frame selection

Among the extracted scene-change frames, we will select the key frames that can better represent the video. In our opinions, the same scene may appear in continuous shots. For example, during the conversation of two persons in a video, the shots on each person’s face may be shown periodically. We will eliminate these shot-change frames so that the computation on content matching can be reduced. Therefore, the distances of a scene-change frame, S i , with its neighboring scene-change frames, S i−2, S i−1, S i+1 and S i+2, are calculated. If the distances are all larger than T I , S i will be selected as a key frame. It should also be noted that, in the process of spatial feature generation described below, S i may be further removed if it contains less textures/information.

2.3 The spatial features

The spatial feature for the content indexing and matching in the proposed scheme is generated in each key frame based on the modification of our previously proposed approach [22], which employs SVD for extracting representative data for these frames. The basic idea is to scale the selected frame into a smaller block with a fixed size. SVD is then applied on the block to acquire the eigenvector pairs corresponding to the largest eigenvalues, which are considered the significant features to represent the content of the block or frame well. The main reason that we choose not to adopt the extraction of interest points in frames, which may be employed in many recent approaches, is the concern of high computational complexity. Even though real-time execution may not be necessary, the requirement of higher efficiency is still very important since this research is targeted at the applications with large video databases. Besides, as mentioned before, the spatial feature extraction in the proposed scheme is to locate the appropriate positions for the calculation of temporal features, which are quite effective in differentiating the contents in video segments as we will discuss later. Reasonably good performances of spatial features should be enough to achieve the related objectives. Therefore, a global spatial feature in a frame will be computed, instead of searching local interest points.

In the proposed scheme, the selected frame will first be blurred by the Gaussian filtering for eliminating noises and facilitating the down-sampling, which scales the active area of the frame to a M × M block. In fact, the block is not necessarily a square but a reasonably large rectangle can work. SVD is then applied on the mean-removed block, X, as
$$ \mathbf{X}=\sum\limits_{m=1}^{M}\lambda_{m}^{\frac{1}{2}}\mathbf{u}_{m}\mathbf{v}_{m}^{T}, $$
(2)
where u m , v m are the eigenvectors of X X T and X T X, respectively, and Λ is a diagonal matrix with the eigenvalues λ 1λ 2 ≥ ⋯ ≥ λ M on the diagonal line. The first and second eigenvectors, u 1, v 1, u 2 and v 2, each of which is a M × 1 vector, are chosen as the extracted features and will be stored in the feature database. If necessary, the size of features can be further reduced by down-sampling M × 1 vectors to N × 1 vectors, which are normalized to have unit norms. It should be noted that scaling the selected frame directly to a small block for the subsequent SVD may have a poor performance because the block may be too blurred. The matching of the spatial features is based on the correlation coefficients between {U 1, V 1} pairs or {U 2, V 2} pairs. If a frame A in an investigated video is to be compared with a frame B stored in the database, the similarity measurement R 1 between \(\{\mathbf {U}^{A}_{1}, \mathbf {V}^{A}_{1}\}\) and \(\{\mathbf {U}^{B}_{1},\mathbf {V}^{B}_{1}\}\) is calculated by
$$ R_{1}=\frac{|{\sum}_{n=1}^{8}\{\mathbf{U}^{A}_{1}(n)\times \mathbf{U}^{B}_{1}(n)+\mathbf{V}^{A}_{1}(n)\times \mathbf{V}^{B}_{1}(n)\}|}{2}. $$
(3)
If R 1 is larger than the threshold T R (set as 0.5), a match of the frames A and B is claimed. The absolute value is employed because the positive and negative {U, V} pairs are exactly the same. According to our experience, when the frame A is a transcoded version, it is possible that the matching of {U 1, V 1} fails but the matching of {U 2, V 2} works. Therefore, when the searching based on R 1 is not so successful, R 2 will be further calculated in a similar way by
$$ R_{2}=\frac{|{\sum}_{n=1}^{8}\{\mathbf{U}^{A}_{2}(n)\times \mathbf{U}^{B}_{2}(n)+\mathbf{V}^{A}_{2}(n)\times \mathbf{V}^{B}_{2}(n)\}|}{2}. $$
(4)
The frames A and B are also viewed as matched frames if R 2 > T R . It should be noted that the eigenvalues, λ 1 and λ 2 can be viewed as the importance measurement of the eigenvector pairs. In the proposed scheme, λ 1 or λ 2 has to be larger than T l empirically set as 128 to make the associated eigenvector pairs be valid spatial features.
The number of features stored in the database is still very large so it is impractical to compare all of them to seek a match. It should be a better strategy to form subsets of features in the database. The features should be appropriately indexed so that the matching can be found more conveniently via searching the related subset only. Considering that the features are vectors, vector quantization (VQ) is a very suitable approach. The algorithms of tree structure VQ [8] or maximum distance method [12] may help to construct such an indexing structure. Nevertheless, training a good VQ codebook for various frame contents may not be a trivial issue here because of the large video database in the server and the requirement of many bits in each index. The performance of the codebook will depend on the training data and may thus become less predictable. Therefore, the proposed scheme resorts to the idea of product VQ by simply using the signs of components in {U, V} as the index. For one key frame, its spatial feature will be stored in at least two positions in the database, one indexed by {U 1, V 1} and the other indexed by {U 2, V 2}. The number of bits in an index is equal to 2N since each of {U, V} have N values. It should be noted that some components of the feature vector may be close to zero and their signs may not be stable after certain content-preserving processing. The index bits will thus be viewed as “don’t-care” bits if the corresponding component values are small, e.g., within \([-\frac {1}{N},+\frac {1}{N}]\). When there are d “don’t care” bits, the feature will be stored in 2 d positions as all of these d bits can be “1” and “0”. It is quite obvious that too many “don’t care” bits will lead to storing the same feature in a lot more positions in the database. In the proposed scheme, if there are more than four “don’t care” bits, which indicate that more than 16 positions are required for storing this feature, the associated key-frame candidate is thought to be less appropriate and removed. Figure 3 summarizes the procedure of selecting the key frames and indexing the spatial features.
Fig. 3

The flowchart of indexing the spatial features

From the above discussion, we know that the block size, M, and the size of U or V will affect the performance. Basically M should be as small as possible but the M × M block still has to contain the outlook of the frame. M is thus set as 16 in the proposed scheme. The number of different indices for {U, V} is then up to 232, which seems too large and many indices may not be fully utilized. Therefore, the down-sampling is applied to make N = 8 and the number of indices is reduced to 216 for {U, V}. As mentioned before, directly scaling the frame into an 8 × 8 block will lose a lot of details of content so the down-sampling is applied to the eigenvectors instead. After recording each value of {U, V} with four bits, the proposed scheme uses 128 bits, including the bits of {U 1, V 1} and {U 2, V 2}, to represent a frame.

2.4 The temporal features

The shot lengths are employed as the temporal features since they can effectively preserve the unique characteristics of a video. The temporal feature is computed by
$$ L_{i}=t(S_{i})-t(S_{i-1}), $$
(5)
where t(S) is the corresponding “time unit” of the shot-change frame S. For an input video, the selected key frames and every shot length will be recorded. Figure 4 illustrates the structure, in which the scene-change frames, S 1, S 3, S q and S q+3 are the 1 s t , 2 n d , p t h and (p + 1) t h key frames respectively. The spatial features of the key frames are computed and recorded along with their positions in the video. A straightforward implementation is to maintain a file that keeps all the shot lengths in a video. Each file offset corresponding to a key frame is recorded along with its spatial feature. If a spatial feature in the test video from a user and that of one video in the database are matched, the temporal features of these two videos can be acquired efficiently by locating the file offsets and retrieving their shot lengths for the subsequent processing.
Fig. 4

The structure of signature/temporal features

The basic operation of temporal feature matching is as follows. As shown in Fig. 4, for a key frame, K p , corresponding to a scene-change frame, S q , the matched frame is located based on the spatial features and then the adjacent shot lengths, {…L q−1, L q , L q+1…} are retrieved. The temporal feature stream is transferred to a function L(t) by
$$ L(t)=\left\{ \begin{array}{lll} L_{q}, & L_{q}>t\geq0\\ L_{q-1}, & 0>t>-L_{q-1}\\ L_{g}, & \text{otherwise}, \end{array} \right. $$
(6)
where g meets either of the following conditions,
$$ \left\{\begin{array}{llll} {\sum}_{j=q }^{g}L_{j}>t\geq{\sum}_{j=q }^{g-1}L_{j}, & t\geq L_{q}\\ {\sum}_{j=g}^{q-1}L_{j}>-t\geq{\sum}_{j=g-1}^{q-1}L_{j}, & -L_{q-1}\geq t. \end{array} \right. $$
(7)
The construction of L(t) can be understood more easily by the illustration of Fig. 5. Each scene will form a square in L(t). Given L A (t) and L B (t) are the temporal feature functions of an investigated video and one original video to be compared respectively, they are said to be matched if
$$ \rho=\frac{1}{t_{e}-t_{s}}{\int}_{\!\!\!\!t_{s}}^{t_{e}}D_{L^{A},L^{B}}(t)dt\geq T_{M}, $$
(8)
where \(D_{L^{A},L^{B}}(t)\) is the difference function and T M is the threshold for indicating a match. t s ≤ 0 and t e ≥ 0 are the starting time and ending time, which can be set according to the following two options. The first option is to let both |t s | and |t e | close to a fixed period of time, say one minute, if the numbers of scenes in the durations of t ≥ 0 and t < 0 are long enough. The second option is to determine |t s | and |t e | according to the number of scenes when the scenes in the investigated video are large. We employ the first option to determine ρ unless very few scenes are to be compared. The difference function, \(D_{L^{A},L^{B}}(t)\), is defined as
$$ D_{L^{A},L^{B}}(t)=\left\{\begin{array}{ll} 1, & |L^{A}(t)-L^{B}(t)|\leq T_{d} \times L^{A}(t)\\ -1, & |L^{A}(t)-L^{B}(t)|>T_{d} \times L^{A}(t), \end{array} \right. $$
(9)
where T d is used to accommodate certain inaccurate scene-change detections and \(\frac {1}{8}\) is a reasonable choice. It should be noted that the introduced T d can also help to solve the lack of accuracy of detecting scene change frames, which are encoded as B frames, as mentioned before. The matched portion will generate positive values while other parts yield negative values. We set T M as 0 since the wrong pair usually results in a quite negative ρ. Since common investigated videos have a few key frames, a small number of key frames that generate matches will be sufficient to indicate that the two videos are related. In fact, we calculate two values ρ + and ρ , corresponding to the integration of (8) from 0 to t e divided by t e and from t s to 0 divided by −t s respectively. In regular cases, ρ in (8) is used to determine the matching result. For the key frames that are located at the beginning or the end of the video, only one of ρ + and ρ will be taken into account. Besides, certain commercials may be inserted into the investigated video and considering ρ + and ρ separately may help to identify such cases.
Fig. 5

The construction of L(t)

The accuracy of matching temporal features is affected by the performance of scene-change detection. It is possible that a scene change is wrongly detected or skipped in the investigated video, which would be a transcoded or processed version. A lower ρ may be attained since the different scene lengths will contribute negative values in computing (8). Given that some scene changes can still be detected correctly, an adjustment procedure is applied during the matching process when ρ + or ρ is not large enough to declare a detection. Figure 6 illustrates the operation. After the matching of spatial features, the temporal features of the investigated video and the video to be compared start from \({L^{A}_{1}}\) and \({L^{B}_{1}}\) respectively. To recalculate ρ +, the scene lengths are accumulated in both videos and their difference is calculated by
$$ L^{A,B}_{x,y}=|\sum\limits_{a=1}^{x} {L^{A}_{a}} - \sum\limits_{b=1}^{y} {L^{B}_{b}}|. $$
(10)
A temporary match will be claimed if \(L^{A,B}_{x,y}\) is less than T j , which is set as \(\frac {1}{10}\) second. In the example shown in Fig. 6, \({L^{A}_{1}}+{L^{A}_{2}}\simeq {L^{B}_{1}}+{L^{B}_{2}}+{L^{B}_{3}}\) so a first match is found. The second match is detected when \({\sum }_{a=1}^{4} {L^{A}_{a}} \simeq {\sum }_{b=1}^{4} {L^{B}_{b}}\). The total matched length will be \(\tilde {L}={\sum }_{a=1}^{\tilde {x}} {L^{A}_{a}}\), where \(\tilde {x}\) is the largest x such that \(L^{A,B}_{x,y}<T_{j}\). If the total length of the investigated video to compute ρ + is \(\bar {L}\), then \(\rho ^{+}=\frac {2\tilde {L}-\bar {L}}{\bar {L}}\). To prevent possible false detections, the resulting ρ + should be larger so a higher threshold set as 0.75 is used. Furthermore, there have to be many (at least three) different {x, y} combinations such that \(L^{A,B}_{x,y}<T_{j}\) (or large k of \(\tilde {L}_{k}\) in Fig. 6) to claim a detection.
Fig. 6

The adjustment of temporal features

3 Experimental results

The proposed scheme extracts scene-change frames, from which the key frames are selected to form the spatial features as an indexing tool. The lengths of shots then act as the temporal features for identifying matches more reliably. Therefore, the scene-change detection process certainly plays an important role in the proposed scheme. We first use the videos from MUSCLE (MUSCLE-VCD2007) [13] to demonstrate the effectiveness of the proposed scene-change detection method. MUSCLE is dedicated to the evaluation of content-based video detection systems. The tested video set of MUSCLE contains around 60 hours of video materials from different sources, including web video clips, TV archives and movies. Various contents are included, such as documentaries, movies, sports events, TV shows and cartoons. Twenty videos are included in this test and the true scene changes are identified by inspection. The results are demonstrated in Table 1, which lists the numbers of correct detections (TP, True Positive), false detections (FP, False Positive) and misses (FN, False Negative). The precision rate is calculated by \(P=\frac {TP}{TP+FP}\) and the recall rate is computed according to \(R=\frac {TP}{TP+FN}\). The average recall rate is 97.64 % and the average precision rate is 92.83 %. Some misses occur when the colors in adjacent scenes are similar, especially in the videos with lower quality. A few test videos are inserted with black frames, which may make the true scene changes with fading effects undetected. Similarly, many false detections happen in the fading effects, from which multiple detections are made. Although the precision rate can still be improved, the proposed scheme tends to identify more scene changes than expected to cope with various types of videos. It should be noted that the accurate recognition of clear or sharp scene changes is more important in the propose scheme. Although this strategy of detecting more scene changes may affect the accuracy of temporal features, the adjustment procedure shown in Fig. 6 or (10) is very helpful in compensating the deficiency and making the temporal features still robust enough to find correct matches. This step of temporal-feature adjustment also alleviates the possible weakness of the proposed scheme in dealing with smooth scene changes.
Table 1

The performance of scene-change detection

Video

Correct detections

False detections

Misses

Recall (%)

Precision (%)

1

27

3

0

100.00

90.00

2

122

18

4

96.83

87.14

3

56

9

0

100.00

86.15

4

105

18

3

97.22

85.37

5

88

5

10

89.80

94.62

6

152

28

5

96.82

84.44

7

235

27

16

93.63

89.69

8

313

50

5

98.43

86.22

9

177

2

2

98.88

98.88

10

236

8

2

99.16

96.72

11

231

8

16

93.52

96.65

12

434

33

3

99.31

92.93

13

120

4

1

99.17

96.77

14

67

4

3

95.71

94.37

15

195

5

1

99.49

97.50

16

3

0

0

100.00

100.00

17

47

1

0

100.00

97.92

18

90

10

0

100.00

90.00

19

533

30

5

99.07

94.67

20

290

9

2

99.32

96.99

Figure 7 shows an example of Video 6 in Table 1, which has the worst precision rate among the twenty test videos. 152 scene-change frames are identified and 28 frames are considered the false detections, which are marked with green ⊗ at the top-left corner. The false detections usually come from large camera motions and the frames with fading effects. The scene-change frames that are chosen as the key frames are marked with red ⊕ at the bottom-right corner. In this example, the scene-change frames are excluded from being consider the key frames mainly because of too many “don’t care” bits existing in the indices or the lack of significant textures determined by the eigenvalues in SVD. The other example (Video 3) is shown in Fig. 8. Nine false detections are identified and 24 out of 56 scene-change frames are selected as the key frames. We can see that many scene-change frames, such as the face at the beginning of the video, are not taken into further consideration because the contents are similar to others and deemed less representative.
Fig. 7

An example of scene-change and key frame selection

Fig. 8

The other example of scene-change and key frame selection

Next, we would like to demonstrate the feasibility of the proposed scheme by examining the performance of copy detection. The feature database is formed by extracting the associated features from a large number of collected videos. A few of these videos are then randomly selected to form many one-minute video segments as the investigated videos. The features of tested video segments will be matched with those stored in the feature database to see whether the segments are copies from the portions of videos in the collected set. The investigated video segments may be distorted to test the robustness of the proposed scheme. The evaluation method of information retrieval in [25] is employed to show the performance with a single indicator, F(β), calculated by
$$ F(\beta)=(1+\beta^{2})\frac{P \times R}{\beta^{2}\times P +R}, $$
(11)
where P and R are precision and recall rates defined before. A larger F(β) means that better robustness and discrimination are both achieved. β is within [0, 1] and 0.5 is chosen such that P is twice as important as R. In addition to the videos from MUSCLE [13], the videos from ReefVid (reefvid.org) [20] are also used for testing the proposed framework. ReefVid videos, which present the scenery of coral reefs, are the same test videos used in the two existing methods, i.e., 3D-DCT by Coskun et al. [3] and TIRI(Temporally Informative Representative Images)-DCT by Esmaeili et al. [6]. Therefore, a more objective comparison can be made. It is worth noting that the indicator F(β) in (11) is also adopted in [3] and [6]. 200 video clips, each of which is longer than one minute, are selected to form the tested video set. Six types of distortions, including adding noises, adjusting the brightness, rotating the frames, temporal shifting, spatial shifting and frame loss, are then applied and their parameters are listed in Table 2. The results are shown in Table 3, from which we can see that the F-scores under these six distortions in both MUSCLE and ReefVid videos are close to one so the proposed scheme has decent performances. The recall rate or True Positive Rate (TPR) and False Positive Rate (FPR) defined as \(\frac {FP}{FP+TN}\) are also listed in Table 3. The good performances against such spatial attacks as noise adding, brightness adjustment, rotation and spatial shifting show the robustness of the proposed spatial features. The content-adaptive nature helps the scheme to perform excellently under the attack of temporal shifting.
Table 2

The parameters of attacks used in the evaluation

Distortion

Effect

Min

Max

Noises(σ)

\(l^{\prime }_{(m,n,k)}=l_{(m,n,k)}+G(0,\sigma )\)

10

70

Brightness(b)

\(l^{\prime }_{(m,n,k)}=l_{(m,n,k)}+b\times \mu _{k}\)

−0.7

0.7

Rotate (ν)

Rotates the frame ν

−5

5

Temporal shift(δ)

Video is shifted δ seconds

−0.5

0.5

Spatial shift(s)

Shift the frame s% right and s% down

−4

4

Frame loss(d)

d% of the frames are randomly dropped

0

10

Table 3

The performances of copy detection by using ReefVid and MUSCLE videos

 

TPR(%)

FPR(%)

F-score

Distortion

ReefVid

MUSCLE

ReefVid

MUSCLE

ReefVid

MUSCLE

Noises

98.10

92.57

3.56

0.00

0.99

0.98

Brightness

98.22

98.71

0.23

0.00

1.00

1.00

Rotate

95.57

95.37

1.09

0.00

0.99

0.99

Temporal shift

99.09

100.00

0.02

0.00

1.00

1.00

Spatial shift

99.89

97.20

1.20

0.00

0.99

0.99

Frame drop

90.36

98.40

2.58

0.00

0.98

0.99

Average

96.87

97.04

1.45

0.00

0.99

0.99

However, the proposed method may be more vulnerable to the attack of serious frame loss, which may alter the temporal features severely. It should be noted that the scheme can survive mild frame loss pretty well since the flexibility has been provided in (9). Under more serous frame loss attacks, which may make the video almost not viewable, the estimation of frame loss rate will be required. According to our experience, when the frame loss rate is lower than 10 %, the lengths of many short scenes will still be kept. Therefore, when calculating the difference function of scene lengths in (9), we also keep track of the histogram of scene lengths. If the response in (8) is not large enough to claim a match but we observe a few similar lengths of scenes during the matching process, these close values will be viewed as the anchor points so that the frame loss rate can be estimated. To be more specific, two close shot lengths in the investigated video and a tested video in the database will trigger the adjustment for the possible frame loss to evaluate the frame loss rate. After adjusting the temporal features accordingly, the proposed scheme can resist such attacks well as shown in Table 3. This method may fail in a frame loss rate higher than 10 %, which, in our opinions, is of very limited practical use though. It is also worth noting that the false positive rates in ReefVid videos are slightly higher because these underwater videos do not usually contain clear scene changes and the adjustment of temporal features in (10) may generate some false alarms. For the videos from MUSCLE, which are closer to regular videos with more scene changes, the robust temporal features in the proposed scheme can identify the content correctly and the false positive rates are equal to zero in our tests.

The details of the experimental results, along with the comparisons with 3D-DCT [3] and TIRI-DCT [6], are demonstrated in Fig. 9. Four lines are shown in each case, including testing the MUSCLE videos by the proposed scheme (△) and testing the ReefVid videos by the proposed scheme (∘), 3D-DCT (∗) and TIRI-DCT (\(\square \)) respectively. The Y-axis shows the F-scores while the X-axis lists the parameters of attacks. As expected, larger distortions decrease the corresponding F-scores and all the three schemes perform reasonably well to meet the requirements of video copy detection. The F-scores of the proposed scheme are above 0.95 even when more serious distortions are tested. Despite the similar performances, the major contributions and advantages of the proposed method are the compact size of the feature database and the execution efficiency. The proposed scheme is compared with TIRI-DCT, which improves 3D-DCT with better feasibility, although 3D-DCT outperforms TIRI-DCT slightly in Fig. 9. For the MUSCLE videos, the size of the features in the proposed scheme is 471 KB, which is around 12 % of that in TIRI-DCT (3945 KB). In the ReefVid videos, the feature size of the proposed scheme is around 10 % of the size in TIRI-DCT as these videos have fewer scene changes so a smaller volume of features will be archived. The advantages of the proposed scheme over TIRI-DCT are also reflected from the time to construct the database. The detailed comparisons of the proposed method and TIRI-DCT by using 200 ReefVid video clips are shown in Table 4. We implemented TIRI-DCT with four different sizes of video segments specified in [6] by using C/C++. The code is run on a machine with Intel Core2 Duo 2.83GHz CPU and 8G RAM. The time to prepare the features in the proposed scheme is 12 to 26 % of the time of TIRI-DCT and the feature size is only 5 to 17 %. It should be noted that, although Fig. 9 seems to show that 3D-DCT is the best method and TIRI-DCT outperforms the proposed scheme, the differences are actually very limited if we check the F-scores in Fig. 9. However, considering the feasibility in terms of time to form the feature database and the size of features, the proposed scheme certainly has an edge over TIRI-DCT, which is an improved method of 3D-DCT.
Fig. 9

The comparisons of performances for a noise adding, b brightness adjusting, c rotation, d temporal shifting, e spatial shifting and f frame loss

Table 4

The comparison with TIRI-DCT using 200 ReefVid video clips

Method

TIRI-DCT(126 bits)

The proposed

 

1 sec

2 sec

3 sec

4 sec

scheme

Time(msec.)

62531

46970

34696

28137

7425

Size(KBytes)

442

241

160

118

22

It is worth noting that limiting the feature size is quite important given the fact that the number of videos considered in this application is very large. Both 3D-DCT and TIRI-DCT store the content information in every a few seconds. This strategy certainly benefits the content matching without the need of dealing with temporal-based attacks but the feature size cannot be controlled easily for longer videos. In real world videos with meaningful content, many scene changes are expected and certain frames should be more important or representative so storing these data only may achieve similar functionalities with much less storage requirement. In addition to the advantage of less time spent on constructing the feature database, the proposed scheme is considered more efficient since it is built on H.264/AVC so the need of transcoding, which is a required pre-processing step in both 3D-DCT and TIRI-DCT, can be reduced. Given a video compressed in H.264/AVC, the scene-change detection is applied in the compressed domain so the temporal features can be acquired conveniently. Besides, if the encoding process adopts the structure of flexible GOP, most of the scene-change frames will be encoded as I frames so the key frames can be directly decoded to calculate the spatial features. Therefore, investigating a given video segment can also be carried out in a very efficient manner.

We try to build a larger video database for testing more practical cases. Six video sets are shown in Table 5. The list of 87 videos of the set “Video-Comparer” can be found in [2], which includes cartoons, old movies, movie trailers and sports videos, with a total length of 36 hours, 39 minutes and 22 seconds (36:39:22). Considering that talk show videos, which contain many similar scenes, may cause troubles in this copy detection application, fifty “David Letterman Shows” video clips with the duration more than nine hours are collected from Youtube for testing. “Kangxi Coming”, a Mandarin talk show, which we are able to find 100 full programs with more than 74 hours from Youtube, is considered to be a more general case. “Empresses in the Palace” is also a Chinese TV drama, which we collect 76 full programs with more than 50 hours. We also collect 200 miscellaneous videos from Youtube with the length of 31 hours, 27 minutes and 28 seconds (31 : 27 : 28). The database is then formed by these five video sets plus MUSCLE videos (58 : 44 : 02) so the total length of the videos in the database is more than 250 hours. The size of the spatial/temporal features is 4113 KB and the time to construct this feature set is 1464 seconds.
Table 5

The performances of copy detection by using general videos

Video

Video-

David

Kangxi

Empresses in

MUSCLE

Miscellaneous

set

comparer

letterman

coming

the palace

videos

videos

Number of videos

87

50

100

76

100

200

Duration

36:39:22

9:35:27

74:39:50

50:39:03

58:44:02

31:27:28

Examined key frames

19990

914

6509

5442

2603

14235

Positives

9453

616

3393

2799

2581

6849

Negatives

10537

298

3116

2643

22

7386

Spatial features

9409

41

1179

2337

20

6154

Temporal features

1128

257

1937

306

2

1232

False positives

0

0

0

0

0

0

False negatives

1

0

0

0

0

0

Match time (sec.)

241.2

8.9

274.4

46.3

41.1

175.6

Key frame (msec.)

12.1

9.7

42.2

8.5

15.8

12.3

In the tests, the resolution of original videos processed by the database is 640 × 360. The tested or investigated videos are low-resolution (426 × 240) versions, which are also downloaded directly from Youtube. For the MUSCLE videos, we transcode them by ourselves with the original CIF resolution at the lower bit-rate. The row “Key frames” shown in Table 5 lists the numbers of checked key frames in the investigated videos. We can see that these values are smaller in such static videos as talk shows. The row “Positives” indicates the numbers of matched spatial/temporal features in these tested key frames. Each match can clearly signal a found copy since the combination of spatial and temporal feature can yield robust results. The row “Negatives” lists the numbers of cases that the key frames from the transcoded investigated videos cannot find the corresponding matches in the database. These values can be divided into two parts, i.e., the numbers of rejection by using the “Spatial features” and those by the “Temporal features” as shown in the following two rows in Table 5. In general cases such as the TV drama, the spatial features can help to reject most of the incorrect key frames or detections. Nevertheless, as mentioned before, many scenes in such programs as talk shows are similar or even the same so we have to resort to the temporal features to find the exact matches. The false positives and negatives are rare in the tests. There is only one miss found in the video set, “Video-Comparer”. The miss or false negative means that we cannot find any match of spatial/temporal features in the entire investigated video, whose features should have been stored in the database. In fact, the 50-minute video that is not detected is a segment of an old movie (Seven Sinners, 1936), which has a very poor quality. The bit-rate of the transcoded video is only 96 Kbps so the scene-change detection and the subsequent key frame selection are affected. Furthermore, although a few key frames are correctly identified, the spatial features are not reliable enough to claim a match. Therefore, no temporal feature is extracted for further comparison and a miss happens. The row “Match time” lists the time spent on testing these videos, which is directly related to the number of key frames and the length of the video set. The last row shows the matching time divided by the number of key frames as the reference. It should be noted that, in practical applications, the detection process can be stopped once a match is found in an investigated video. In the proposed method, as long as the investigated videos (and the original videos) are H.264/AVC video streams, the processing can be very efficient since the scheme is implemented under H.264/AVC and no additional transcoding is necessary.

Instead of using the entire transcoded videos as the investigated ones, we randomly crop 120 video segments, each of which is one minute long, from the six video sets to simulate searching a short video in a very large database. The results are demonstrated in Table 6. There is no false positive detection as expected since the temporal features are very reliable. However, the results do show one possible weakness of using key frames as we can observe some misses in the row “False negatives” of Table 6, especially in the talk show videos. The reason of misses is that the key frames calculated in the database are not included in the cropped video segments. Possible solutions to tackle this issue are taking the scene-change frames in longer shots into account and adjusting the thresholds of comparing the differences between scene-change frames to increase the number of key frames.
Table 6

The performances of copy detection by using video segments

Video

Video-

David

Kangxi

Empresses in

MUSCLE

Miscellaneous

set

comparer

letterman

coming

the palace

videos

videos

Key frames

1182

226

246

371

151

934

Positives

581

43

94

177

125

493

Negatives

601

183

152

194

14

441

Spatial features

539

33

33

176

12

412

Spatial features

62

119

89

18

2

29

False positives

0

0

0

0

0

0

False negatives

5

8

9

3

2

4

Match time (sec.)

12.7

1.8

7.6

2.4

1.8

10.3

Key frame (msec.)

10.7

8.0

30.9

6.5

11.9

11.0

Finally, since there are many different uploaded versions of the programs, “Kangxi Coming” and “Empresses in the Palace”, we also collect many video clips of these two TV series from different sources in Youtube as the investigated videos to test the proposed method. Although the uploaded videos may have different resolutions and formats, as long as they have a reasonable frame quality and are long enough, the proposed scheme has no problem in matching the data from most of the test videos. A few issues that may affect the detection results are as follows. First, the investigated video may undergo serious cropping, resulting in a different aspect ratio. Since the proposed method employs a global spatial feature by scaling the entire frame into a fixed-size block, the large scale cropping may cause problems in the spatial feature matching as the spatial features will be different from those extracted from the full frames. Second, if the test videos are not H.264/AVC streams, we need to apply the transcoding, which may cause certain deviations in the temporal features. This is another reason why T d in (9) is introduced to increase the resilience of the temporal feature matching. Third, subtitles and logos are usually added in most of the uploaded videos. Such editing is seldom an important factor as the affected areas on frames are usually small. Nevertheless, if the logo is superimposed like a large border around the frame, the spatial feature may be affected, especially when the detection of active areas in the proposed scheme fails to remove these regions. Besides, the removal of the borders may be equivalent to cropping in certain uploaded videos. Fourth, some video clips are even edited versions. Sometimes, like movie trailers, black frames are inserted in the locations of scene changes and these frames will change the temporal features. If such edited video clips are also the targets to be matched, removing black frames before the temporal feature matching may be a helpful step.

It is worth noting that there are several empirically set thresholds in the proposed scheme as listed in Section 2, which may affect the performances of content matching in different manners. In fact, setting these thresholds in an optimized way is quite challenging since a very large number of videos with various kinds have to be tested. We adopt a simple approach by assigning them with a few common values, i.e., 0.25, 0.5 or 0.75 so that the design of the proposed scheme can be followed more easily and, most importantly, satisfactory results can be achieved by such settings. More delicate determinations of these threshold values will certainly help to further enhance the performances of the proposed method.

The proposed scheme may have a weakness that the generation of temporal features fail in videos without any scene changes. Considering commercial value or feasibility in popular videos, we exclude such videos in the proposed design. If the detection/investigation of these static videos are required or the videos are precious documentary films to be archived, one solution is to forcibly insert or claim a scene change in the frame that has the largest variations in a fixed long period of time, in which no scene change has been detected. Since this strategy may have negative effects on common videos, video classification may have to be applied to process different types of videos in varying ways. In addition, the conditions of classifying videos have to be as similar as possible so that the original videos and the investigated one will be processed with the same procedures.

4 Conclusion

An efficient content-based video copy detection scheme based on the spatial and temporal feature extraction and matching is proposed in this research. The key-frames are selected to generate the spatial features, which are used as the anchor points for the temporal feature matching. The design considers the video coding structure so the efficiency and the compact size of feature database are the main contributions of the proposed framework. The experimental results also show that the extracted features can facilitate fast content matching for identifying the possible copies. The proposed method can thus be feasible in matching contents in very large video databases. Many uploaded videos may be different episodes of the same series of program and contain similar scenes. Therefore, the combination of spatial and temporal features is a very suitable approach to identify such cases in this copy detection application.

Notes

Acknowledgments

This research is supported by the Ministry of Science and Technology in Taiwan, ROC, under Grants MOST 103-2221-E-008-080 and MOST 104-2221-E-008-075.

References

  1. 1.
    Awad G, Over P, Kraaij W (2014) Content-based video copy detection benchmarking at TRECVID. ACM Trans Inf Syst:32Google Scholar
  2. 2.
    Benchmark videos from Youtube [Online]. Available: http://www.video-comparer.com/product-benchmark-youtube-list.php
  3. 3.
    Coskun B, Sankur B, Memon N (2006) Spatio-temporal transform based video hashing. IEEE Trans Multimedia 8:1190–1208CrossRefGoogle Scholar
  4. 4.
    Cotsaces C, Nikolaidis N, Pitas I (2006) Shot detection and condensed representation—a review. IEEE Signal Process Mag 23:28–37CrossRefGoogle Scholar
  5. 5.
    De Roover C, De Vleeschouwer C, Lefebvre F, Macq B (2005) Robust video hashing based on radial projections of key frames. IEEE Trans Signal Process 53(10):4020–4037MathSciNetCrossRefGoogle Scholar
  6. 6.
    Esmaeili MM, Fatourechi M, Ward RK (2011) A robust and fast video copy detection system using content-based fingerprinting. IEEE Trans Inf Forensics Secur 6:213–226CrossRefGoogle Scholar
  7. 7.
    Ferman A, Tekalp M, Mehrotra R (2002) Robust color histogram descriptors for video segment retrieval and identification. IEEE Trans Multimedia 11:497–508Google Scholar
  8. 8.
    Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer Academic PublishersGoogle Scholar
  9. 9.
    Hsu W, Chua TS, Pung HK (1995) An integrated color-spatial approach to content-based image retrieval. In: Proceeding of ACM MultimediaGoogle Scholar
  10. 10.
    Kang L-W, Hsu C-Y, Chen H-W, Lu C-S (2010) Secure SIFT-based sparse representation for image copy detection and recognition. In: IEEE International Conference on Multimedia Exposition, pp 1248–1253Google Scholar
  11. 11.
    Kashino K, Kurozumi T, Murase H (2003) A quick search method for audio and video signals based on histogram pruning. IEEE Trans Multimedia 5(3):348–357CrossRefGoogle Scholar
  12. 12.
    Katsavounidis I, Kuo C-C, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters 1(10):144–146CrossRefGoogle Scholar
  13. 13.
    Law-To J, Joly A, Boujemaa N (2007) Muscle-VCD-2007: a live benchmark for video copy detection. http://www-rocq.inria.fr/imedia/civr-bench/
  14. 14.
    Ling H, Cheng H, Ma Q, Zou F, Yan W (2012) Efficient image copy detection using multiscale fingerprints. IEEE MultiMedia 19:60–69CrossRefGoogle Scholar
  15. 15.
    Liu H, Lu H, Xue X (2013) A segmentation and graph-based video sequence matching method for video copy detection. IEEE Trans Knowl Data Eng 25:1706–1718CrossRefGoogle Scholar
  16. 16.
    Liu J, Huang Z, Cai H, Ngo HTSCW, Wang W (2013) Near-duplicate video retrieval: current research and future trends. ACM Comput Surv:45Google Scholar
  17. 17.
    Liu T, Zhang H-J, Qi F (2003) A novel video key-frame-extraction algorithm based on perceived motion energy mode. IEEE Trans Circuits Syst Video Technol 13:1006–1013CrossRefGoogle Scholar
  18. 18.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110CrossRefGoogle Scholar
  19. 19.
    Lu S, Wang Z, Mei T, Guan G, Feng D (2014) A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans Multimedia 16:1497–1509CrossRefGoogle Scholar
  20. 20.
    ReefVid A Resource of Free Coral Reef Video Clips for Educational Use [Online]. Available: http://www.reefvid.org
  21. 21.
    Song J, Yang Y, Huang Z, Shen HT, Luo J (2013) Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Trans Multimedia 15:1997–2008CrossRefGoogle Scholar
  22. 22.
    Su PC, Chen C-C, Chang H-M (2009) Towards effective content authentication for digital videos by employing feature extraction and quantization. In: IEEE Transactions on Circuits and Systems for Video Technology, vol 19, pp 668–677Google Scholar
  23. 23.
    Swain M, Ballard D (1991) Color indexing. Int J Comput Vis:7Google Scholar
  24. 24.
    Tan YP, Saur DD, Kulkarni SR, Ramadge PJ (2000) Rapid estimation of camera motion from compressed video with application to video annotation. IEEE Trans Circuits Syst Video Technol:133–145Google Scholar
  25. 25.
    van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, LondonMATHGoogle Scholar
  26. 26.
    Wang T, Yu W, Chen L (2007) An approach to video key-frame extraction based on rough set. In: International Conference on Multimedia and Ubiquitous Engineering, 2007. MUE ’07, pp 590–596Google Scholar
  27. 27.
    Wolf (1996) Key frame selection by motion analysis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp 1228–1231Google Scholar
  28. 28.
    Wu P-H, Thaipanich T, Jay Kuo C-C (2009) A suffix array approach to video copy detection in video sharing social networks. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Taipei, TaiwanGoogle Scholar
  29. 29.
    Zargari F, Mehrabi M, Ghanbari M (2010) Compressed domain texture based visual information retrieval method for I-frame coded pictures. IEEE Trans Consum Electron 56:728–736CrossRefGoogle Scholar
  30. 30.
    Zhou X, Zhou X, Chen L, Bouguettaya A, Xiao N, Taylor JA (2009) An efficient near-duplicate video shot detection method using shot-based interest points. IEEE Trans Multimedia 11:879–891CrossRefGoogle Scholar
  31. 31.
    Zhuang Y, Rui Y, Huang TS, Mehrotra S (1998) Adaptive key frame extracting using unsupervised clustering. In: Proceedings of IEEE International Conference on Image Processing, pp 866–870Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Computer Science and Information EngineeringNational Central UniversityJhongliRepublic of China

Personalised recommendations