Advertisement

Shot Boundary Detection and Key Frame Extraction for Sports Video Summarization Based on Spectral Entropy and Mutual Information

Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 221)

Abstract

Video Summarization methods attempt to abstract the main occurrences, scenes, or objects in a clip in order to provide an easily interpreted synopsis of the video. This is an essential task in video analysis and indexing applications. New methods for detecting shot boundaries in video sequences and extracting key frames using metrics based on information theory are proposed in this work. The method for shot cut detection relies on the mutual information between the frames. The method for key frame extraction uses the difference of entropy value computed from eigen value matrix of consecutive frames to decide which frames to choose as key frame. The proposed method captures satisfactorily the visual content of the shot. The information theory measure provides the better results because it exploits the inter-frame information in a more compact way. It can also be successfully compared to other methods published in literature. The method for key frame extraction employs entropy measure computed on eigen values of frames to reduce complexity of computation. The proposed algorithm can capture the important yet salient content as the key frame. Its robustness and adaptability are validated by experiments with various kinds of video sequences.

Keywords

Video summarization Dynamic key frames extraction Mutual information Entropy difference measure Information theory 

1 Introduction

The developments in software tools in the last few years’ have made available myriad applications in areas such as multimedia databases quite feasible. The incredible rates at which these databases are being published have far exceeded the capabilities of current text-based cataloguing. New techniques and approaches and quick search algorithms have increased the potential of the media databases, which are now containing, not only text and image but video and audio as well. Extensive research efforts have been made with regard to the retrieval of video and image data based on their visual content such as color distribution, texture and shape. These approaches are mainly based on feature similarity measurement. The proposed work is an attempt to use information theory based parameters and similarity in terms of these parameters to represent visual content and changes in the content to achieve shot boundary detection and key frame extraction for summarization of sports videos.

The summarization, indexing and retrieval of digital video is an active research area. Shot boundary detection and key frame extraction are important tasks for analyzing the content of video sequences for indexing, browsing, searching, summarizing and performing other content based operations of large video databases.

The video shot is a basic structural building block of a video sequence and its boundaries need to be determined possibly automatically to allow for content based video abstraction. A video shot may be defined as a sequence of frames captured by one camera in a single continuous action in time and space [1]. It should be a group of frames that have consistent visual characteristics (including color, texture and motion). After shots are segmented, key frames can be extracted from each shot. Key frame is the frame which can represent the salient content of the shot. Depending on the content complexity of the shot, one or more frames can be extracted [2].

Key-frames are still images which best represent the content of the video sequence in an abstracted manner, and may be either extracted or reconstructed from original video data. Key-frames are frequently used to supplement the text of a video log, but there has been little progress in identifying them automatically. The challenge is that the extraction of key-frames needs to be automatic and content based so that they maintain the important content of the video while removing all redundancy.

Key-frames provide a suitable abstraction and framework for video indexing, browsing and retrieval. The use of key frames greatly reduces the amount of data required in video indexing and provides an organizational framework for dealing with video content. Much research has been done in key frame extraction. The simplest methods proposed choose only one frame for each shot usually the first one, regardless of the complexity of visual content. The more complicated approaches take into account visual content, motion analysis and shot activity [3]. These approaches either cannot effectively capture the major visual content or are computationally expensive.

Key-frame based video representation views video abstraction as a problem of mapping an entire Video segment to some small number of representative frames. The extraction of key frames must be automatic and content based so that they maintain the salient content of the video while avoiding all redundancy. Many of the systems described in the literature use a constant number of key frames from a fixed position in the shot or several frames separated by a fixed distance. Better key-frames could be chosen if shot content were considered in the selection.

In this work, a new approaches for shot boundary detection based on the mutual information and key-frame extraction based on the difference in entropy of eigen value matrix between consecutive frames is proposed. The mutual information is a measure of the information passed from one frame to another. Mutual information is used for detecting abrupt cuts, where the image intensity or color is abruptly changed. A large difference in content between two frames, that shows a weak inter-frame dependency leads to a small value of mutual information. The considerable change in the entropy beyond certain threshold at any pair of consecutive frames indicates movement from one significant segment of shot to another significant segment of shot and representative key frames will be extracted from both segments of shots to abstract video shot leading to summary of entire video.

The proposed work first detects shot boundaries and divides video into meaningful units by exploiting the capability of Mutual information to differentiate the frames in terms of their content. Then the algorithm detects the points of significant changes within a shot by looking at changes that are occurring in entropy values of each frame. The threshold is so defined that every significant change in the visual content within a shot are picked and from every such points the key frames are extracted. So as the overall algorithm extracts key frame from all parts of video a smooth summary of complete video can be constructed. The experimentation also reveal that performance parameters of this algorithm are comparable with many of recent algorithms which have been agreed to provide a good summary of the video.

The remaining part of the paper is organized into 6 sections. Section 2 presents the related work and background of the algorithms used. Section 3 describes the new proposed algorithm. In Sect. 4 the algorithm developed for shot boundary detection is presented. Section 5 discusses the scheme of key frame extraction. Section 6 discusses the results obtained, and Sect. 7 brings up the conclusion.

2 Related Work

Recent works in the story based video summarization area can be classified into three categories, based on the method of key frame extraction namely: sampling based, shot based, and segment based. In the sampling based approaches, key frames were extracted by randomly choosing from sampling of the original video. It is the straightforward and easy way to extract key frames, yet such an arrangement may fail to capture the real video content, especially when it is highly dynamic. In the shot based approaches, the video is segmented into separate shots and one or more key frames is extracted from each shot. A sequence of frames captured by one camera in a single continuous action in time and space is referred to as a video shot [4]. Normally, it is a group of frames that have constant visual attributes, (such as color, texture, and motion). It is a more significant and straightforward method to extract key frames by adapting to dynamic video content. In shot based approaches a typical and easy manner is utilizing low level features such as color and motion to extract key frames. More complicated systems are based on color clustering, global motion, or gesture analysis which are found in [5, 6]. In segment based approaches, the video segments are extracted by clustering of frames and then the key frames are chosen as those frames of video that are closest to the centroids of each calculated clusters. Some systems based on this approach can be found in [7, 8, 9].

Even though shot boundary detection and key frame extraction are strongly related, the two problems have usually been addressed separately. Often a boundary detection algorithm is first used to detect the shots followed by a key frame extraction. In [10], the first frame of the shot is selected as the key frame. Work in [11] assumes that the shot boundaries have been detected and use an unsupervised clustering algorithm to find the key frame of the given shot. Another approach is to compare consecutive frames in video. Compare the difference in color histograms of consecutive frames with a threshold to obtain key frames [12].

Towards the more complicated end, techniques based on clustering have also been developed. The technique in [13] clustered visually similar frames and used constraints on key frame positions in the clustering process for selecting key frames. In the algorithm described in [14] the video is down sampled to decrease the number of frames and features are extracted from frames. The refined feature space obtained by applying SVD and feature points are clustered, the key frame is then extracted from each cluster.

Different methods can be used to select key frames. In general these methods assume that the video has already been segmented into shots (a continuous sequences of frames taken over a short period of time) by a shot detection algorithm, and extract the key-frames from within each shot. One of the possible approaches to key frame selection is to take the first frame in the shot as the key frame [15]. Few methods use the first and last frames of each shot [16, 17]. Other approaches time sample the shots at predefined intervals, as in [18] where the key frames are taken from a set location within the shot, or, in an alternative approach where the video is time sampled regardless of shot boundaries and frames at regular intervals are picked as key frames [19]. These approaches do not consider the dynamics in the visual content of the shot but rely on the information regarding the sampled sequences and boundaries. They often extract a fixed number of key frames per shot. In [20] only one key frame is extracted from each shot: the frames are segmented into objects and background, and the frame with the maximum ratio of objects to background is chosen as the key frame of the segment since it is assumed to convey the most information about the shot. In the work proposed in [21] the key frames are extracted based on compression policy. The algorithm proposed in [22] divides frames into clusters, and the key frames are selected from the largest clusters only. In [23] constraints on the position of the key frames in time are used in the clustering process; a hierarchical clustering reduction is performed, obtaining summaries at different levels of abstraction.

In order to take into account the visual dynamics of the frames within a sequence, some approaches compute the differences between pairs of frames (not necessarily consecutive) in terms of color histograms, motion, or other visual descriptions. The key frames are selected by analyzing the values obtained. A simple method for key frame extraction is a frame is selected as a key frame if its color histograms differs from that of the previous frame by a given threshold. In [24] frame differences are taken to build a “content development” curve that is approximated, using an error minimization algorithm, by a curve composed of a predefined number of rectangles. The method in [25] proposes a very simple approach, the key frames are selected by an adaptive temporal sampling algorithm that uniformly samples the y-axis of the curve of cumulative frame differences. The resulting non-uniform sampling on the curve’s x-axis represents the set of key frames.

The compressed domain is often considered when developing key frame extraction algorithms since it easily allows to express the dynamics of a video sequence through motion analysis. A neural network based approach using motion intensities computed from MPEG compressed video is presented in [26]. A fuzzy system classifies the motion intensities in five categories, and those frames that exhibit high intensities are chosen as key frames [27]. In [28] perceived motion energy (PME) computed on the motion vectors is used to describe the video content. A triangle model is then employed to model motion patterns and extract key frames at the turning points of acceleration and deceleration.

The drawback to most of these approaches is that the number of representative frames must be prefixed in some manner depending on the length of the video shots for example [29]. This cannot guarantee that the frames selected will not be highly correlated. It is also difficult to set a suitable interval of time, or frames: large intervals mean a large number of frames will be chosen, while small intervals may not capture enough representative frames, those chosen may not be in the right places to capture significant content. Still other approaches that work only on compressed video, are threshold-dependent, or are computationally intensive (e.g. [29, 30]).

In this paper, a novel approach is proposed for selection of key frames that determines the complexity of the sequence in terms of changes in the pictorial content using entropy of eigen values of each frame. Similarity between frames is computed in terms of difference in entropy calculate using eigen value of frames within an identified shot. The frame differences are then used to dynamically and rapidly select a variable number of key frames within each shot. The method works fast on all kind of videos (compressed or not), and does not exhibit the complexity of some of the existing methods based on clustering. It can also output key frames while computing the frame differences without having to process the whole shot.

The proposed mechanism of mutual information based shot boundary detection and further key frame extraction from separated shots based on spectral entropy is presented in following section.

3 The Proposed Mechanism for Shot boundary detection and Key Frame Extraction

The overall key frame extraction algorithm may be summarized as follows
  1. Step 1.

    Extract the individual frames from a video. Consider \( F _{t} \) as the tth frame and \( F _{t + 1} \) as very next neighbor of \( F _{t} \).

     
  2. Step 2.

    Calculate the Mutual information. \( I_{t,t + 1}^{R,G,B} \) between pair of consecutive frames for full length of the video using the equations, viz.

     
$$ I_{t,t + 1}^{R} \, = \, \mathop \sum \limits_{i = 0}^{N - 1} \mathop \sum \limits_{i = 0}^{N - 1} {\mathbf{C}}_{{{\mathbf{t}},{\mathbf{t}} + 1}}^{{\mathbf{R}}} (i,j)log\frac{{{\mathbf{C}}_{{{\mathbf{t}},{\mathbf{t}} + 1}}^{{\mathbf{R}}} ({\mathbf{i}},{\mathbf{j}})}}{{{\mathbf{C}}_{{{\mathbf{t}},{\mathbf{t}} + 1}}^{{\mathbf{R}}} \left( i \right){\mathbf{C}}_{{{\mathbf{t}},{\mathbf{t}} + 1}}^{{\mathbf{R}}} ({\mathbf{j}})}} $$
$$ I_{t,t + 1} \, = \,I_{t,t + 1}^{R} \, + \,I_{t,t + 1}^{G} \, + \,I_{t,t + 1}^{B} $$
\( C_{t,t + 1}^{R} \left( {{\text{i}},{\text{j}}} \right) \)

Represents Co occurrence Matrix between Frames Ft and Ft+1

\( I_{t,t + 1}^{R} \)

Mutual Information between Ft and Ft+1 for the red component

\( I_{t,t + 1}^{G} \)

Mutual Information between Ft and Ft+1 for the green component

\( I_{t,t + 1}^{B} \)

Mutual Information between Ft and Ft+1 for the blue component

\( {\text{I}}_{{{\text{t}},{\text{t}} + 1}}^{{}} \)

Total Mutual Information between Frames Ft and Ft+1

  1. Step 3.

    Consider a Window of 100 frames within the window Compare Mutual information values to a threshold. In this work threshold is 0.30 × Mean of all Mutual information values in the window.

     
  2. Step 4.

    The shot boundary will be detected between any two frames \( t,t + 1 \) if there Mutual information \( I_{t,t + 1}^{R,G,B} \) is less than threshold. (The detailed mechanism of shot boundary detection is presented in Sect. 4).

     
  3. Step 5.

    Consider all the frames within a shot and compute the eigen values of each frame

     
$$ [V_{t} D_{t } ]\, = \,eig[f_{t} ] $$
where Dt is diagonal matrix containing eigen values of frame image.
  1. Step 6.

    Compute the entropy from diagonal matrix Dt as

     
$$ E_{t} \, = \,\mathop \sum \limits_{j = 0}^{n} {\text{e}}_{\text{j }} \, { \log }{\raise0.7ex\hbox{${ 1}$} \!\mathord{\left/ {\vphantom {{ 1} {{\text{e}}_{{ {\text{j }}}} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${{\text{e}}_{{ {\text{j }}}} }$}} $$
where \( E_{t} \, = \,Entropy \; of \; frame \;t \)
$$ {\text{e}}_{\text{j }} \, = \,j {\text{th}} eigen \; value \; of \; frame \; t $$
For all the frames within a shot.
  1. Step 7.

    Start calculating difference in the entropy value of consecutive frame from first to last frame in the shot, If this difference \( {\text{d}}_{\text{t}} \, = \, {\text{E}}_{\text{t}} - {\text{E}}_{{{\text{t}} + 1}} \) at any instant t satisfy the condition \( {\text{d}}_{\text{t}} > {\text{T}}_{\text{k}} \)

     
where \( T_{k} = 2.0 \times \) Mean of the entropy values in the Shot. At that instant of frame \( F _{t } \) select frame \( F _{t + 1} \) as key frame.
  1. Step 8.

    For every shot add first frame as key frame and the shots where the difference value \( d_{t} \) is always less than \( T_{k} \) select only first frame as key frame as change in visual content is negligible any frame can become representative key frame.

     

The proposed algorithm internally uses shot boundary detection mechanism based on mutual information between two successive frames is calculated separately for each of the RGB components. A small value of the mutual information It,t+1 leads to a high probability of having a cut between frames ft and ft+1. The details of the approach are presented in following section.

4 Shot Boundary Detection

The proposed approach for Shot boundary detection is based on the Mutual Information (MI) which is the measure of information transported from one frame to another one. It is used for detecting abrupt cuts, where the image intensity or color is abruptly changed. A large video content difference between two frames, showing weak inter-frame dependency leads to a low MI In this approach, the mutual information and the joint entropy between two successive frames is calculated separately for each of the RGB components. As the color intensity levels of the image sequence vary from 0 to N−1, at frame f t three \( N \times N \) matrices \( {\text{C}}^{R}_{t;t + 1} ,{\text{ C}}^{G}_{t;t + 1} \,{\text{and}}\,{\text{C}}^{B}_{t;t + 1} \) are created, that carry information on the grey level transitions between frames f t and f t+1.

In other words, considering only the R component, the matrix \( {\text{C}}^{R}_{t;t + 1(i,\,j )} , \) with 0 < i < N−1 and 0 < j < N−1, corresponds to the joint probability: a pixel with grey level i in frame f t has color intensity level j in frame f t+1. \( {\text{C}}^{R}_{t;t + 1(i,\,j)} \) represent a co-occurrence matrix between frames f t and f t+1. The mutual information \( I^{R}_{t;t + 1(i,\,j)} \) of the transition from frame f t to frame f t+1 for the R component is expressed by Eq. (1) as presented in [30]:
$$ I_{t,t + 1}^{R} = - \mathop \sum \limits_{i = 0}^{N - 1} \mathop \sum \limits_{j = 0}^{N - 1} {\mathbf{C}}_{{{\mathbf{t}},{\mathbf{t}} + 1}}^{{\mathbf{R}}} (i,j)log\frac{{{\mathbf{C}}_{{{\mathbf{t}},{\mathbf{t}} + 1}}^{{\mathbf{R}}} ({\mathbf{i}},{\mathbf{j}})}}{{{\mathbf{C}}_{{{\mathbf{t}},{\mathbf{t}} + 1}}^{{\mathbf{R}}} \left( i \right){\mathbf{C}}_{{{\mathbf{t}},{\mathbf{t}} + 1}}^{{\mathbf{R}}} ({\mathbf{j}})}} $$
(1)
and the total mutual information is given by Eq. (2):
$$ I_{t,t + 1} = I_{t,t + 1}^{R} + I_{t,t + 1}^{G} + I_{t,t + 1}^{B} $$
(2)
A small value of the mutual information I t;t+1 as shown in the Fig. 1 leads to a high probability of having a boundary between frames f t and f t+1.
Fig. 1

Time series of the mutual information from a video sequence showing detection of shot boundary at low value of mutual information. X-axis frame number. Y-axis: mutual information

Basically, in this context boundary detection is an discontinuity detection in a one-dimensional signal. Several algorithms exist for discontinuity detection, based on threshold value. In order to detect possible shot boundary, an adaptive thresholding approach is employed in this work. The local mutual information mean values on a one-dimensional temporal window W of size N W are obtained. In the proposed mechanism the value of N W is considered as sequence of 100 frames from video. The boundary will be detected between any pair of frames f t and f t+1 if the mutual information between them falls empirically below the 30 % of the mean of all the mutual information values in that window of 100 values.

5 Key Frame Extraction

Key-frames are still images which best represent the content of the video sequence in an abstracted manner, and may be either extracted or reconstructed from original video data. The proposed approach extracts key frames from each meaningful shot of the video segmented by identifying shot boundaries by Mutual information based algorithm. The method for key frame extraction employs entropy measure computed using eigen values of frames to reduce complexity of computation. The entropy measure exploits the inter-frame information flow in a more compact way. The proposed algorithm can capture the important yet salient content as the key frame. The entropy of each frame is computed by calculating entropy of Eigen value matrix of frame as in the Eq. (3)
$$ [V_{t} D_{t } ] = eig[f_{t} ] $$
(3)
where
$$ V_{t} = eigen \; vector \; matrix \; of \; frame \; t $$
$$ D_{t} = diagonal \; matrix \; containing \; eigen \; values \; of \; frame \; t $$
The entropy is calculated for each frame as entropy of matrix \( D_{t } \) as shown in Eq. (4)
$$ E_{t} = \mathop \sum \limits_{j = 0}^{n} {\text{e}}_{\text{j }} \; { \log }{\raise0.7ex\hbox{${ 1}$} \!\mathord{\left/ {\vphantom {{ 1} {{\text{e}}_{{ {\text{j }}}} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${{\text{e}}_{{ {\text{j }}}} }$}} $$
(4)
where
$$ E_{t} = Entropy \; of\;frame \; t $$
$$ {\text{e}}_{\text{j }} = j^{th} eigen \; value \; of \; frame \; t $$

The proposed approach uses entropy values calculated as indicated by Eq. (4), the difference value of entropy between any two consecutive frames provide the information about content changes between consecutive frames in the shot [30]. If a video shot is S = \( \left\{ {f1, f2,f3\, \ldots \, \ldots \, \ldots \,ft} \right\} \) where \( f1, f2,f3\, \ldots \, \ldots \, \ldots \,ft \) are individual frames within the shot S obtained by proposed method for shot boundary detection. Let the entropy values in this shot be E = \( \left\{ {E1, E2,E3\, \ldots \, \ldots \, \ldots \,Et} \right\} \). In order to find if the content in the shot changes significantly, the difference of the values \( d_{t} = E_{t} - E_{t + 1} \) of entropies in this shot is calculated. The value \( d_{t} \) is compared to predefined threshold value \( T_{k} \), in the proposed work the value of \( T_{k} \) is taken as \( T_{k} = 2.0 \times \) Mean of the entropy values in the Shot.

At any pair of frames ft and ft+1 within shot if the condition \( d_{t} > T_{k} \) is satisfied it indicates significant change in the content of shot so the frame after change is picked as Key frame. If \( d_{t} < T_{k} \) it means the content during the shot changes are negligible [31]. In the proposed method for every shots the first frame is also selected as a key frame. For shots with no or small changes whatever frame can effectively express visual content so the first frame is considered as Key frame as in [32]. The results of proposed scheme have demonstrated that the key frames can effectively generate meaningful summary.

A video summary should not contain too many key frames since the aim of the summarization process is to allow users to quickly grasp the content of a video sequence. For this reason, algorithm is also evaluated with the compactness (compression ratio) of the summary that can be generated by extracted key frames. The compression ratio is computed by dividing the number of key frames in the summary by the length of video sequence. For a given video sequence St, the compression rate is thus defined as in Eq. (5).
$$ CRatio\left( {S_{t} } \right) = 1 - {\raise0.7ex\hbox{${\gamma_{NKF} }$} \!\mathord{\left/ {\vphantom {{\gamma_{NKF} } {\gamma_{NF} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\gamma_{NF} }$}} $$
(5)
where
$$ \gamma_{NKF} = Number \; of \; key \; frames $$
$$ \gamma_{NF} = Number \; of \; frames \; in \; the \; Video \; Sequence $$

Ideally, a good summary produced by a key frame extraction algorithm will presents high compression ratio (i.e. small number of key frames) [33]. In the propose work, the compactness (compression ratio) is calculated for key frames extracted from 6 videos and results are presented in the ensuing section.

6 Experimentation and Results

The algorithm was implemented in the Matlab and experiments are performed on a Core2 Duo 2.40 GHz Windows machine with 2 GB of RAM. In all the experiments reported in this section, the video streams are in AVI format, with the digitization rate equal to 25 frames/sec. In order to validate the effectiveness of the proposed shot boundary detection and key frame extraction algorithm, experiments conducted are presented in the following. The performance of the proposed algorithm is evaluated on the YOU TUBE videos, Soccer clips and Open Videos data set. The performance of the proposed algorithm is evaluated using evaluation metrics.

Initially all the video clips have been split into simple frames in order to identify possible key-frames. These can be judged by a human who watches the entire video. Although key-frames are a semantic concept, relative agreement can be reached among different people. The work presented first performs shot boundary detection by identifying the points on time axis of video sequence where Mutual information between two frames falls to minimum value.

The effectiveness of proposed algorithm is presented by performance parameters of the proposed model on six different video samples. The performance of the proposed shot boundary detection model is evaluated using precision and recall as evaluation metrics. The precision measure is defined as the ratio of number of correctly detected cuts to the sum of correctly detected and falsely detected cuts of a video data and recall is defined as the ratio of number of detected cuts to the sum of detected and undetected cuts as indicated in Eq. (5)
$$ \left. {\begin{array}{*{20}l} {{\text{Recall}}\, = \,\frac{{{\text{Number}}\:{\text{of}}\:{\text{correctly}}\:{\text{detected}}\:{\text{boundaries}}}}{{{\text{Number}}\:{\text{of}}\:{\text{True}}\:{\text{boundaries}}}}} \\ {{\text{Precision}}\, = \,\frac{{{\text{Number}}\:{\text{of}}\:{\text{correctly}}\:{\text{detected}}\:{\text{boundaries}}}}{{{\text{Number}}\:{\text{of}}\:{\text{Totally}}\:{\text{detected}}\:{\text{boundaries}}}}} \\ \end{array} } \right\} $$
(5)
A video summary should not contain too many redundant frames since the aim of the summarization process is to allow users to quickly grasp the content of a video sequence. For this reason, key frame extraction algorithm is also evaluated with the compactness (compression ratio) of the summary that can be generated by extracted key frames. The compression ratio is computed by dividing the number of key frames in the summary by the length of video sequence. For a given video sequence St, the compression rate is thus defined as in Eq. (6)
$$ CRatio\left( {S_{t} } \right) = 1 - {\raise0.7ex\hbox{${\gamma_{NKF} }$} \!\mathord{\left/ {\vphantom {{\gamma_{NKF} } {\gamma_{NF} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\gamma_{NF} }$}} $$
(6)
where
$$ \gamma_{NKF} = Number\, of\, key\, frames $$
$$ \gamma_{NF} = Number \; of \; frames \; in \; the \; Video \; Sequence $$
The Fig. 2a shows the plot of mutual information values for window of 100 frames from a video. The threshold is plotted as per the proposed algorithm and mutual information value falling below the threshold is indicative of a shot boundary between those two frames for which the mutual information value is computed. The Fig. 2a indicates the detection of shot boundary between \( frame\,28 \) and \( frame \, 29.\) The shot boundary detected from another video sequence between frame 49 and 50 is shown in Fig. 2b.
Fig. 2

a The plot of mutual information values against frame index and threshold and indication of shot boundary between frame 28 and frame 29. b A shot boundary detected between frame 49 and frame 50 using proposed algorithm. c The plot of spectral entropy difference values against frame index and threshold and indication of frames 7, 13, 19 and 25 as key frames

The Fig. 2c is the plot of entropy difference values plotted against frame index shows that the difference value \( d_{t} \) is more than threshold \( T_{k} \) at four instants of difference values that are between frames 6 and 7, frames 12 and 13, frame 18 and 19 and frames 24 and 25. So as per the proposed algorithm the frames 7, 13, 19 and 25 will be selected as key frames. The four frames extracted out of 28 frames (Fig. 3a) are given in the Fig. 3b.
Fig. 3

a The 28 frames from representative shot separated from a video clip of 1,342 frames using mutual information based algorithm. b The 4 key frames extracted using proposed algorithm based on spectral entropy difference based algorithm

The experimentation is conducted to test performance of the proposed mechanism on set of video sequence and results for 6 such video sequence from different sources is presented in Table 1.
Table 1

Performance parameters of proposed shot boundary detection algorithm and results of experimentation on keyframe extraction in terms of number of frames in a shot, number of key frames extracted and time taken by processor for extraction of key frames

Videos

Shot boundary detection

Key frame extraction

Compression ratio

Recall

Precision

Shot no.

1

2

3

4

5

6

1

1.000

1.000

No of frames

60

146

56

409

36

188

0.88

Key frames

8

15

1

43

4

28

Time

23.89

53.02

22.05

148

13.49

67.01

2

1.000

1.000

No of frames

52

170

188

395

174

131

0.91

Key frames

8

19

12

42

18

14

Time

33.7

62.19

64.2

150.2

63.94

58.5

3

0.857

0.950

No of frames

49

91

538

312

109

170

0.92

Key frames

1

9

54

34

4

15

Time

16.1

31.53

176.19

99.2

56.97

72.46

4

0.857

1.000

No of frames

60

146

58

409

36

188

0.90

Key frames

5

16

6

42

3

28

Time

24.5

60.35

24.05

163.16

14.3

55.77

5

0.903

0.975

No of frames

28

115

88

113

103

240

0.92

Key frames

4

13

7

7

5

23

Time

10.84

39.92

32.62

57.42

43.15

86.59

6

0.875

0.950

No of frames

 640

 660

    

0.90

Key frames

 68

70 

    

Time

 236

248

    
The graph shown in the Fig. 4a and b reveal that number of key frames extracted depends upon number of frames in every shot separated using proposed method. The time taken by the mechanism is also proportional to number of frames in shot under consideration.
Fig. 4

a Graph of number of frames, number of key fames extracted plotted against shot number. b Plot of number of total number key frames extracted form different video sequences of different frame lengths and time taken by algorithm for extraction of key frames from different videos

The Fig. 5 is the snap shot of a folder containing 134 key frames extracted from a video of 1,342 frames which can successfully construct the summary of video.
Fig. 5

The snapshot of folder containing total key frames extracted out of video of 1,342 frames

The proposed mechanism for key frame extraction is not a method which picks up the frames at some fixed regular intervals as many algorithms do, instead every content change within every shot is detected and key frames are extracted from instances of content change only. So the key frames extracted can construct a summary without redundant frame and covering all important information content of the video. The average values of performance parameters recall and precision up to 95 % for SBD algorithm and average compression ratio of 92 % for key frame extraction scheme indicates that the proposed mechanism is suitable for extraction of key frames to construct the summary of sports videos.

7 Conclusion

In this paper an algorithm for key frame extraction is presented, which selects the variable number of key frames from shots segmented using mutual information as difference and for similarity measure to differentiate two consecutive frames. The performance of the shot boundary detection algorithm used is presented in terms of Precision and Recall. The proposed key frame selection algorithm picks the key frames whenever there is change in the visual content of the shot, the change in the visual content is determined in terms of entropy change that occurs from frame to next frame in a shot. The entropy computation takes only eigen values of frames instead of complete image entropy, which will minimize computational complexity and time taken. The frames within a shot are extracted as key frame if their entropy difference value with previous frame crosses predefined threshold. The results have shown that the algorithm is able to summarize the video capturing all salient events in the video sequence. The compression ratio which is ratio of number of key frames utilized to build the summary to total number of frames in video for the algorithm is 92 %. The value of performance parameter reveal that the algorithm can preserve over all information while compressing video by over 92 %.

References

  1. 1.
    Yang S, Lin X (2005) Frame extraction using unsupervised clustering based on a statistical model. Tsinghai Sci Technol J 10(2):169–173MathSciNetCrossRefGoogle Scholar
  2. 2.
    Zhuang Y, Rui Y, Huang TS, Metrotra S (2002) Adaptive key frame extraction using unsupervised clustering. In: Proceeding of IEEE international conference on image processing (ICIP’02), Chicago IL, pp 886–890Google Scholar
  3. 3.
    Lai J-L, Yi Y (2012) Key frame extraction based on visual attention model. J Vis Commun Image Process Elsevier 23(2012):114–125Google Scholar
  4. 4.
    Hanjalic A (2002) Shot-boundary detection: unraveled and resolved? IEEE Trans Circ Syst Video Technol 12(2):90–105CrossRefGoogle Scholar
  5. 5.
    Girgensohn A, Boreczky J (2000) Time-constrained key frame selection technique. J Multimedia Tools Appl 11:347–358MATHCrossRefGoogle Scholar
  6. 6.
    Ju SX, Black MJ, Minneman S, Kimber D (1998) Summarization of videotaped presentations: automatic analysis of motion and gestures. IEEE Trans Circ Syst Video Technol 8(5):686–696CrossRefGoogle Scholar
  7. 7.
    Turaga P, Veeraraghavan A (2009) Unsupervised view and rate invariant clustering of video sequences. Comput Vis Image Underst 3:353–371CrossRefGoogle Scholar
  8. 8.
    Liu L, Fan G (2005) Combined key-frame extraction and object based video segmentation. IEEE Trans Circ Syst Video Technol 15(7):869–884CrossRefGoogle Scholar
  9. 9.
    Setnes M, Babuska R (2001) Rule base reduction: some comments on the use of orthogonal transforms. IEEE Trans Syst Man Cybern Part C Appl Rev 31(2):199–206CrossRefGoogle Scholar
  10. 10.
    Nagasaka A, Tanaka Y (1992) Automatic video indexing and full-video search for object appearances, in second working conference on visual database systemsGoogle Scholar
  11. 11.
    Zhuangy Y, Rui Y, Huang TS, Mehrotra C (2007) An adaptive key frame extraction using unsupervised clustering. IEEE Trans Circ Syst Video Technol pp 168–186Google Scholar
  12. 12.
    Hanjalic A, Langendijk RL (2006) A new key-frame allocation method for representing stored video stream, 1st international workshop on image databases and multimedia searchGoogle Scholar
  13. 13.
    Girgensohn A, Boreczky J (2001) Time-constrained key-frame selection technique. Multimedia Tools ApplGoogle Scholar
  14. 14.
    Chitra A, Dhawale and Sanjeev, Jain. SI (2008) Summarization a novel approach towards keyframe selection for video. Asian J Inf TechnolGoogle Scholar
  15. 15.
    Tonomura Y, Akutsu A, Otsugi K, Sadakata T (1993) VideoMAP and videospaceicon: Tools for automatizing video content. In: Proceedings of ACM INTERCHI’93 conference, pp 131–141Google Scholar
  16. 16.
    Ueda H, Miyatake T, Yoshizawa S (1991) Impact: An interactive natural-motion-picture dedicated multimedia authoring system. In: Proceedings of ACM CHI’91 conference, pp 343–350Google Scholar
  17. 17.
    Rui Y, Huang TS, Mehrotra S (1998) Exploring video structure beyond the shots. In: Proceedings of IEEE international conference on multimedia computing and systems (ICMCS), Texas USA, pp 237–240Google Scholar
  18. 18.
    Pentland A, Picard R, Davenport G, Haase K (2003) Video and image semantics: advanced tools for telecommunications. IEEE Multimedia 1(2):73–75Google Scholar
  19. 19.
    Sun Z, Ping F (2004) Combination of color and object outline based method in video segmentation. In: Proceedings of SPIE storage and retrieval methods and applications for multimedia, 5307:61–69Google Scholar
  20. 20.
    Arman F, Hsu A, Chiu MY (1993) Image processing on compressed data for large video databases. In: Proceedings of ACM multimedia ‘93, Annaheim, CA, 1993:267–272Google Scholar
  21. 21.
    Savakis A, Rao RM (2003) Key frame extraction using in MPEG7 motion descriptors. In: Proceedings of 37th Asilomar conference on signals, systems and computersGoogle Scholar
  22. 22.
    Girgensohn A, Boreczky J (2000) Time-constrained keyframe selection technique. Multimedia Tools Appl 11:347–358MATHCrossRefGoogle Scholar
  23. 23.
    Gong Y, Liu X (2000) Generating optimal video summaries. In: Proceedings of IEEE international conference on multimedia and expo, 3:1559–1562Google Scholar
  24. 24.
    Zhao L, Qi W, Li SZ, Yang SQ, Zhang HJ (2000) Key-frame extraction and shot retrieval using nearest feature line (NFL). In: Proceedings of ACM international workshops on multimedia information retrieval, pp 217–220Google Scholar
  25. 25.
    Hanjalic A, Lagendijk RL, Biemond J (1998) A new method for key frame based video content representation. In: Image databases and multimedia search, World Scientific SingaporeGoogle Scholar
  26. 26.
    Hoon SH, Yoon K, Kweon I (2000) A new technique for shot detection and key frames selection in histogram space. In: Proceedings of the 12th workshop on image processing and image understanding pp 475–479Google Scholar
  27. 27.
    Narasimha R, Savakis A, Rao RM, De Queiroz R (2004) A neural network approach to key frame extraction. In: Proceedings of SPIE-IS&T electronic imaging storage and retrieval methods and applications for multimedia, 5307:439–447Google Scholar
  28. 28.
    Calic J, Izquierdo E (2002) Efficient key–frame extraction and video analysis. In: Proceedings of IEEE ITCC’2002 multimedia web retrieval section, pp 28–33Google Scholar
  29. 29.
    Liu TM, Zhang HJ, Qi FH (2003) A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE Trans Circ Syst Video Technol 13(10):1006–1013CrossRefGoogle Scholar
  30. 30.
    Cover TM, Thomas JA (2003) Elements of information theory, John Wiley & Sons Publications. ISBN 0-471-06259-6Google Scholar
  31. 31.
    Ciocca G, Schettini R (2004) An innovative algorithm for key frame extraction in video summarization. Int J Pattern Recogn Artif Intell 18(5):819–846CrossRefGoogle Scholar
  32. 32.
    Truong BT, Venkatesh S (2007) Video abstraction: A systematic review and classification. ACM transactions on multimedia computing, communications and applications, vol 3, No. 1, Article 3Google Scholar
  33. 33.
    Ciocca G, Schettini R (2004) An innovative algorithm for key frame extraction in video summarization. J Real-Time Image Process 1(1):69–88Google Scholar

Copyright information

© Springer India 2013

Authors and Affiliations

  1. 1.Department of CSEBasaveshwar Engineering CollegeBagalkotIndia

Personalised recommendations