TensorBased Shot Boundary Detection in Video Streams
 1k Downloads
 3 Citations
Abstract
This paper presents a method for content change detection in multidimensional video signals. Video frames are represented as tensors of order consistent with signal dimensions. The method operates on unprocessed signals and no special feature extraction is assumed. The dynamic tensor analysis method is used to build a tensor model from the stream. Each new datum in the stream is then compared to the model with the proposed concept drift detector. If it fits, then a model is updated. Otherwise, a model is rebuilt, starting from that datum, and the signal shot is recorded. The proposed fast tensor decomposition algorithm allows efficient operation compared to the standard tensor decomposition method. Experimental results show many useful properties of the method, as well as its potential further extensions and applications.
Keywords
Video shot detection Anomaly detection Tensor decomposition Tensor frames Dynamic tensor analysisIntroduction
Enormously increasing amounts of visual information raise the needs for the development of automatic data analysis methods. Among these, special attention is paid to the video summarization methods. Their goal is to give a user a short and useful visual abstract of the entire video sequence. This can be further used to improve efficacy of video cataloging, indexing, archiving, information search, data compression, to name a few [47]. These methods, in turn, rely on efficient algorithms of boundary detection in the visual streams. Basically, the summarization methods are divided into static and dynamic ones [33, 35]. In the former, a set of consecutive frames with sufficiently coherent contents is represented by a single representative frame, called a key frame. On the other hand, a dynamic summarization, also called a video skim, relies on selecting the most relevant small fractions of the video sequences [19]. This allows for better representation of the concise portions of the video, thanks to inclusion of visual and audio tracks. A key step in these methods is to track a video stream and recognize regions of sufficiently abrupt change in the visual content. However, such generally stated task is subjective and depends on specific video content which manifests with different segmentations done in humanmade experiments. For the purpose of automatic video summarization, a significant research is conducted toward the development of methods for computation of sufficiently discriminative features [20, 45]. In this spirit, majority of the methods first employ extraction of specific features, which are then used for video clusterization and classification.

the original tensor framework for video shot detection based on the tensor model and its continuous buildandupdate mechanism;

novel concept of drift detectors in tensor video streams based on statistical analysis of the differences of the tensorframe projections onto the model;

fast tensor subspace computation with the modified fixedpoint eigenvalue method, endowed with a mechanism of dominant rank determination and reinitialization from previous eigenvectors.
As mentioned, the proposed method can operate with any dimensional signals, such as monochrome and color videos, as presented in the experimental section of this paper. No feature extraction is necessary and the signal is treated “as is”. Therefore, the method can work with frames of any dimension and content, such as hyperspectral or compressed signals, etc. However, due to the socalled “curse of dimensionality” when going into higher dimensions, we face the problem of high memory and computational requirements [4, 55]. In addition, if statistical properties of the processed signal are known, then featurebased methods can perform better taking advantage of the expert knowledge. Nonetheless, the proposed framework can also accept features and signal as its input, as will be discussed.
The rest of this paper is organized as follows. Section 2 outlines the basic concepts related to video signal representation and analysis, as well as presents related works in the area of the video segmentation, tensor processing, and data streams. A short overview on tensors and their decomposition is presented in Sect. 3. The main concepts of the proposed method are presented in Sect. 4. Specifically, system architecture and basic concepts of tensor stream analysis are presented in Sect. 4.1, and a modified best rank(R _{1}, R _{2}, …, R _{ P }) tensor decomposition for stream tensors is discussed in Sect. 4.2. Efficient computation of the dominating tensor subspace is outlined in Sect. 4.3. Computation of the proposed drift detection in video tensor streams is presented in Sect. 4.4. Finally, the video tensor stream model buildandupdate scheme is presented in Sect. 4.5. Experimental results are presented in Sect. 5, while conclusions are presented in Sect. 6.
Overview of Video Signal Representation and Analysis
This section contains a brief introduction to the video representation and signal analysis methods for the purpose of scene shot detection and video summarization.
Video Structure and Analysis

Hard cuts—an abrupt change of a content;

Soft cuts—a gradual change of a content.

Fadein/out—a new scene gradually appears/disappears from the current image;

Dissolve—a current shot fades out, whereas the incoming one fades in.
Detection of different types of shots is a special case of the concept drift detection in streams of data [17, 28, 41]. In the case of video signals, analyzed in this paper, detection of different types of shots requires development of special methods which account for different statistical properties of the video signals. This problem, as well as the proposed solutions, is discussed in further sections of this paper.
Related Works
As already mentioned, due to large social expectations, video analysis and especially video summarization gain high attention in research community. Recently, many automatic systems for video segmentation have been proposed. A prevailing majority of them relies on specific feature extraction and further clusterization and classification [10, 40, 54]. A survey of video indexing, including methods of temporal video segmentation and keyframe detection, is contained in the paper by Asghar et al. [1]. Description of the main tasks in video abstraction is provided in the paper by Truong and Venkatesh [45]. On the other hand, Fu et al. present an overview of the multiview video summarization methods [15]. Valdes and Martinez discuss on efficient video summarization and retrieval tools [47]. A recent survey of video scene detection methods is provided in the paper by Fabro and Böszörmenyi [14]. They classify the video segmentation methods into seven groups, depending on the lowlevel features used for the segmentation. In result, the visualbased, audiobased, textbased, audio–visual, visual–textual, audio–textual, and hybrid segmentations can be listed. An overview of papers, presenting more specific methods, is as follows.
Lee et al. propose a unified scheme of shot boundary and anchor shot detection in news videos [33]. Their method relies on a singular value decomposition and the socalled KernelART method. The anchor shot detection is based on the skin color detector, face detector, and support vector data descriptions with nonnegative matrix factorization. However, their method is limited to the videos containing specific scenes, persons, etc. On the other hand, DeMenthon et al. presented a simple system of video summarization by curve simplification [37]. In their approach, a video sequence is represented as a trajectory curve in a highdimensional feature space. It is then analyzed thanks to the proposed binary curvesplitting algorithm. This way, partitioned video is represented with a treelike structure. However, the method relies on feature extraction from the video. These are DC coefficients used in the MPEGcoding standard. The other method of video summarization was proposed by Mundur et al. [38]. In their method, keyframebased video summarization is computed with the Delaunay clustering [29]. In the STIMO system proposed by Furini, a method for moving video storyboard for the web scenario is proposed [16]. Their method is optimized for Web operation to produce onthefly video storyboards. The method is based on a fastclustering algorithm that selects the most representative video content using color distribution in the HSV color space, computed on a framebyframe basis.
A method, called VSUMM, is proposed in the paper by de Avila et al. [2]. Their approach is based on computation of color histograms from frames acquired from a video stream one per second, then clustered with the kmeans method. Further on, for each cluster, a frame closest to the cluster center is chosen as the socalled keyframe that represents a given slice of a video. De Avila et al. also developed a method of video static summaries evaluation which is used in further works for comparison. A significant contribution of their work is that all of the user annotations used in their experiments were also made available from the Web, what greatly facilitates further comparative evaluations [22]. In addition, in this paper, we refer to their data and provide comparative results, as will be discussed.
Color histogram for video summarization is also used in the method proposed by CayllahuaCahuina et al. [7]. In their method, at first, a 3D histogram of 16 × 16 × 16 bins is calculated directly in the RGB color space. This way, obtained vectors of 4096 elements are further processed with the PCA method for compression. In the next step, the two clustering algorithms are employed. The first FuzzyART is responsible for determination of a number of clusters, while the second Fuzzy CMeans perform frame clustering based on the color histogram features. However, reported results indicate that using only color information is not enough to obtain the satisfactory results [7].
In the paper by Medentzidou and Kotropoulos, a video summarization method based on shot boundary detection with penalized contrast is proposed [36]. This method also relies on color analysis, using the HSV color space, however. The mean of the hue component is proposed as a way of change detection. Then, video segments are extracted to capture video content. These are then described with a linear model with the addition of noise. As reported, the method obtains results comparable to other methods, such as [2].
The other video summarization method, called VSCAN, was proposed by Mahmoud et al. [35]. Their method is based on a modified densitybased spatial clustering of applications with noiseclustering method (DBSCAN), which is used to video summarization from color and texture features of each frame.
In our framework, we utilize tensor analysis. In the next section, we provide a brief introduction to this domain, introducing notation and concepts necessary for the understanding of the presented method. In the first row, the papers by de Lathauwer et al. can be recommended [30, 31, 32]. Tensors and their decompositions are also described in the papers by Kolda [27], Kolda and Bader [26], and also in the books, for example, by Cichocki [6], or the one by Cyganek [10].
In this paper, we also rely on the concepts of the data stream analysis, concept drift detection, as well as data classification in data streams. Recently, these gained much attention among researchers. This new domain of data processing is analyzed, for instance, in the book by Gama [17] or in the papers by Krawczyk et al. [28] or Woźniak et al. [53], to name a few. The two domains—that is,streams of tensor data—were pioneered by the works by Sun et al. [43, 44], as will be further discussed.
Introduction to Tensors and Their Decompositions
In this section, we present the basics of visual signal representations with tensors, as well as signal analysis based on tensor decompositions. Especially, tensor decompositions gained much attention, since they lead to discovery of latent factors, multidimensional data classification [9, 42] as well as for data compression [52]. Among many, the most prominent is the higher order singular value decomposition (HOSVD), canonical decomposition (Candecomp/Parafac) [25, 26, 51], as well as the Tucker [46] and the best rank (R _{1}, R _{2},…, R _{ P }) decomposition [10, 11, 30, 32]. The method presented in this paper relies on the latter one, so its details as well as its computational aspects are discussed in further sections of this paper.
Signal Processing with Tensors
Multidimensional measurements and signals are frequently encountered in science and industry. Examples are multimedia, video surveillance, hyperspectral images, sensor networks, to name a few. However, their processing with the standard vectorbased methods is usually not sufficient due to the loss of important information contained in internal structure and relations among data. Thus, for analysis and processing, more appropriate are methods that directly account for data dimensionality. In this respect, it was shown that tensorbased methods frequently offer many benefits, such as in the case of the tensorfaces, as well as view synthesis proposed by Vasilescu and Terzopoulos [48, 49, 50], or data dimensionality reduction by Wang and Ahuja [51, 52], handwritten digits recognition proposed by Savas and Eldén [42] or road signs recognition by Cyganek [9], to name a few. In this section, we present a brief introduction to the tensor analysis necessary for understanding of further parts of this paper. However, this short introduction by no means excerpts the subject. Further information can be found in many publications, such as, for instance, [5, 10, 11, 24, 25, 32, 48].
A jth flattening is obtained from \({\mathcal{A}}\) after selecting a jth dimension, which becomes a row index of A _{(j)}, whereas a product of all the rest indices constitutes a column index of A _{(j)}. Columns of A _{(j)} are called jth fibers of \({\mathcal{A}}\), and they span a jth tensor subspace.
That is, Eq. (3) is an extension of the “classical” matrix multiplication, in which the product involves multiplication of elements of the kth fibers of the tensor \({\mathcal{T}}\) and consecutive rows of the matrix M. As a result, the jth dimension of T is changed to Q [10, 30].
On the other hand and contrary to the wellknown matrix algebra, there is no single definition of a rank of a tensor. There are at least three different concepts of a tensor rank from which the socalled rth rank will be used throughout this paper: An rth rank of a tensor \({\mathcal{A}}\) is a dimension of the vector space spanned by the columns of the rth flattening A _{(r)} of this tensor.
The Best Rank(R _{1}, R _{2}, …, R _{ P }) Tensor Decompositions
There are many types of tensor decompositions which aim at representing a tensor as a product of other tensors. These components frequently need to fulfill specific conditions such as orthogonality or sparsity. The goals can be manifold, such as to reveal latent components of an input tensor, to span orthogonal tensor subspaces, or for data reduction, to name a few [5, 6, 10, 24, 30]. The first decomposition method was proposed by Tucker in the mid60s to analysis of psychometric data [46]. Recent 20 years benefited in the development of many new tensor decomposition methods, though [5, 6, 10, 26, 27]. Among them, usually, the following three are highlighted: the higher order singular value decomposition (HOSVD) [8, 9], the Candecomp/Parafac (CP) [25], as well as the best rank(R _{1}, R _{2},…, R _{ P }) decomposition [32]. These extend the concept introduced by Tucker, as will be discussed.
In other words, the Tucker decomposition given in (7) reads that a tensor \({\mathcal{T}}\) is approximated by its projection onto the space spanned by the matrices S _{ k }. Thus, the whole task is to compute the series of S _{ k } mode matrices. However, in many applications, it is beneficial to request in (4) orthogonality or specific ranks of S _{ k }. If such a constraint is assumed, then the Tucker decomposition (4) leads to the best rank(R _{1}, R _{2},…, R _{ P }) decomposition, defined as follows [32].
The best rank(R _{1}, R _{2},…, R _{ P }) approximation of a tensor \({\mathcal{T}} \in \Re^{{N_{1} \times N_{2} \times \ldots \times N_{P} }}\) is a tensor \({\tilde{\mathcal{T}}}\) of ranks in each of its modes \({\textit{rank}}_{1} {\tilde{\mathcal{T}}} = R_{1}\), \({\textit{rank}}_{2} \;{\tilde{\mathcal{T}}} = R_{2}\),…, \({\textit{rank}}_{P} {\tilde{\mathcal{T}}} = R_{P}\), respectively, which minimizes the function (7).
In Algorithm 1, the function fds(S,R,T) denotes the fast dominating subspace computation. It serves computation of the R left leading eigenvectors of the matrix S, with the initial values taken from the matrix T. For this purpose, the classical SVD Jacobibased algorithm can be used. However, in this work, we propose to use a much faster fixedpoint algorithm with further modifications, as discussed in Sect. 4.3. This is a key modification to the best rank(R _{1}, R _{2},…, R _{ P }) algorithm which allows for more efficient processing of the video streams.
However, in the presented system, the dynamic tensor analysis will be used which aims at processing a series of tensors, to build their representing model. For this purpose, Algorithm 1 needs to be extended, as will be discussed.
Tensor Model and Inferring Method for Video Data Stream Analysis
Presented video shot detection method employs the stream and dynamic tensor analysis methods proposed by Sun et al. [43, 44]. However, we introduced a number of modifications to allow its efficient operation on video streams of different dimensions, as well as a new method of construction of a tensor model specific to the video signals. In this section, we describe the basic ideas behind the tensor stream processing framework, as well as our proposed features which allow efficient video shot detection in various types of video streams.
System Architecture and Basic Concepts of the Tensor Stream Analysis
In the proposed system, a stream of input tensor data of the same order P is assumed. As already mentioned, in this version, we do not assume any feature extraction from the input tensors. This means that any type of signal can be applied to the presented algorithms. Thus, the method works fine with both the monochrome as well as color video streams. This makes the proposed method versatile. Nevertheless, if there is additional knowledge of a signal type, then methods that compute specific features can be more discriminative but only for this specific type of a signal. This happens with color histograms, frequently used in video shot detection [2, 7, 35]. Such methods will not work with monochrome or hyperspectral images, though. Thus, in our framework, features can also be included in the input tensors to make the signals more discriminative, but this possibility is left for further research. To proceed, we introduce a number of definition after the work of Sun et al. [44]: a sequence of tensors is a series of m tenors \({\mathcal{A}}\) _{ i }, where 1 ≤ i ≤ m, each of Pthorder \({\mathcal{A}} \in \Re^{{N_{1} \times N_{2} \times \ldots N_{P} }}\), and assuming that m is constant. A stream of tensors is a sequence of m tensors, where m is a natural value increasing with time.
Our proposed system operates as follows. From the input stream of tensor data, a window of consecutive frame tensors of size W is selected. All of them are used to build a best rank(R _{1}, R _{2},…, R _{ P }) tensor model, as described in Algorithm 1. However, the main modification of this model, which we incorporate after the work by Sun et al., consists of computing a covariance matrix out of all flattened versions of the tensors from the selected window W. This way, for each of the P flattenings, a single covariance matrix is created from all tensors in the input window. Thus, such a covariance matrix conveys statistical information on all of the input tensors in a given flattening mode. The next computational perk of this approach is that the covariance matrix is a positive definite one for which a more effective eigenvalue decomposition method could be applied, as will be discussed. It is worth noticing that such an approach differs from the building and decomposing of a new large tensor composed of all tensors from the window stream after adding a new dimension N _{ P+1} of value W. Details of the modified best rank(R _{1}, R _{2},…, R _{ P }) decomposition, as well as of the model buildandupdate procedures from the streams of tensor data are presented in the next section.
A Modified Best Rank(R _{1}, R _{2},…, R _{ P }) Tensor Decomposition for Stream Tensors
Algorithm 2 presents the best rank(R _{1}, R _{2},…, R _{ P }) tensor decomposition algorithm modified for processing of a stream of tensor data.
When compared to Algorithm 1, the new steps are (14) and (15) which serve computations of the covariance matrices from each tensor in the input window W and at each of the P flattening modes. In step (16), the fast dominating subspace is computed with the fds (C _{ k } ^{(t+1)} , R _{ k }, S _{ k } ^{(t)} ) function. Its details are described in the next step. Nevertheless, it is worth noticing that in each iteration, it starts with an initial value of S _{ k } ^{(t)} computed in the previous step, thus speeding up the convergence, as was verified experimentally. Only the first iteration starts with a random initialization. It is important to notice that the two loops in Algorithm 2 determine the speed of a model build and, in consequence, computation of the whole method.
Efficient Computation of the Dominating Tensor Subspace
Application of the tensor decomposition to analysis of tensor streams, such as video signals, would be very limited or even useless if cannot be efficiently computed. In this respect, the best would be to have a realtime response or at least an answer in a user acceptable time, which certainly is a subjective notion. To be more concrete, we measure processing time in terms of the input (tensor) frame processing. However, efficiency means also computational stability, as will be discussed.
As already mentioned, thanks to the formulation of computation steps in Algorithm 2 with only the positive real symmetric matrices, it is possible to employ a faster algorithm than the SVD decomposition of matrices for a general case. For this purpose, in our system, the socalled fixedpoint eigen decomposition algorithm is used for computation of the fds function from the step (16) in Algorithm 2. This method was first proposed to be used in tensor decomposition by Muti and Bourennane [39]. Then, it was also used for computation of the HOSVD tensor decomposition in the work by Cyganek and Woźniak [12]. Algorithm 3 shows the key steps of this method.
However, after longer use of this algorithm for different tensor decompositions, we noticed its problems with its numerical stability. Under closer examination, it appeared that the problem is usually caused by the multiplication step (21) in Algorithm 3, especially if elements of the input matrix C have magnitudes few order of magnitude higher than values of the vector e _{ k }. Therefore, our proposition is to add an additional normalization step (22) just after the multiplication (21) and before the Gram–Schmidt orthogonalization step (23). Our experiments showed that the proposed modification improved the numerical stability of the whole procedure with insignificant additional computations.
The next modification proposed in this paper is an automatic mechanism of evaluating the number of important eigenvalues k _{imp}, which is determined dynamically during the run of the algorithm. The main idea consists in checking a ratio of logarithms of the eigenvalues, corresponding to the consecutive eigenvectors, respectively. If this ratio exceeds a preset threshold, in our experiments set to 1.5, then the procedure is stopped. In practice, this simple rule allows for significant computational savings.
The next simple modification is initialization of the starting values of the eigenvectors with these values computed in the previous iteration, rather than initialization with random values, as presented in (16). This relies on a simple observation that the computed eigenvectors do not lie far from each other between consecutive iterations. Therefore, such initialization, usually, leads to the faster convergence than when started from the randomly chosen initial values. Random initialization is applied only at the first run.
Concept Drift Detection in Video Tensor Streams
The goal of the concept drift detection is to discover conditions in which a data model is not sufficient to well represent content of the current data stream. This usually happens due to changes of the statistical properties of the data stream caused by external factors. The concept drift detection in different streams belongs to a dynamical research area [17, 18, 28]. In the case of video streams, a concept drift usually corresponds to content change in video signal due to a hard or soft shots, as already discussed.
In the proposed system, the data model is obtained with the best rank(R _{1}, R _{2},…,R _{ P }) tensor decomposition, as already described. The model is stored as a series of matrices S _{ k } and the variance matrices C _{ k }. Then, a model fit measure for a new tensor \(\mathcal{X}\) is indirectly obtained from the error value (7). However, we need to develop a measure of checking if a new tensor \(\mathcal{X}\) fits to the model.
Video Tensor Stream Model BuildandUpdate Scheme
All of the aforementioned mechanisms are assembled here into a complete tensorbased method for efficient video stream analysis. Algorithm 4 shows our proposed tensor model build and concept drift detection mechanisms for video stream analysis.
It is worth noticing that the initial rank values R _{1}, R _{2},…,R _{ P } are the maximal possible ranks that are considered when building the model. However, real ranks are determined based on the already described automatic rank assessment mechanism and, in practice, are usually much smaller than the conservatively assumed initial ranks. In our experiments, these are set as a percentage of the dimensions of the tensor frames. For instance, these are R _{1} = 0.25N _{1}, R _{2} = 0.25N _{2}, and R _{3} = N _{3}, where N _{1} and N _{2} stand for the column and row dimensions, whereas N _{3} for the color one, respectively.
The next stabilizing mechanism introduced in our system is a counter of consecutive frames that do not fit the current tensor model. If the model is not fulfilled for all consecutive G frames, then only the total model rebuild procedure is launched. This saves from costly model rebuild procedure in case of spurious frames, e.g., caused by mechanical deterioration of the celluloid film material, scratches, noise, etc. In our system, G was in the ranges 1–7.
Last but not least, problem is how to update the parameters (26)–(28) when updating the model with a new tensor frame. In this case, the mean and the standard deviations are recomputed after obliterating a value of the first tensor from the previous model W. Thus, for efficient computation, all Θ(\({\mathcal{T}}\) _{ i }) need to be stored for each ith tensor of the model. Upon model update with a \({\mathcal{T}}\) _{ i+1} tensor, also its new value Θ(\({\mathcal{T}}\) _{ i+1}) is computed and added to the stored series. Finally, the mean and standard deviation are recomputed in accordance with (26)–(28), all for differences of error functions (29), as already discussed.
Experimental Results
The method was implemented in C++ in the Microsoft Visual 2015 IDE. The DeRecLib library was used for tensor processing and basic decompositions [10, 13]. The experiments were run on a computer equipped with the Intel^{®} Xeon^{®} E1545 processor operating at 2.9 GHz, with 64 GB RAM, and 64bit version of Windows 10.
For evaluation, we used the VSUMM database and provided therein summarization of different videos [2, 22]. Such an approach has been undertaken by many researchers and allows qualitative evaluation and comparison among a group of video summarization methods. This database contains 50 videos from the OpenVideo Project [21]. The videos are stored in the MPEG1 format with 30 fps and resolution 352 × 240 pixels, color and sound, with duration in the range of 1–4 min. They are classified into different genres, such as documentary, educational, ephemeral, historical, and lecture.
Key control parameters of the presented method
Parameter  Description  Values used in the experiments 

W  Size of the tensorframe window used to build a model (step 4 in Algorithm 2)  3–55 
\(\alpha\)  A model forgetting factor in (20)  0.6–1.0 
a, b  Tensorframe fit measure (30)  a in 1.5–3.7 b in 0.2–2.5 
G  Number of consecutive frames to launch rebuild of the tensor model (31)  1–11 
R _{ 1 }, R _{ 2 }, …, R _{ P }  Assumed ranks of the mode matrices (Algorithm 2)  5–75% of each dimension of the tensor frame 
From the parameters collected in Table 1, the most sensitive is the values a and b of the tensor frame fit measure in (30), as well as the model window size W and the model forgetting factor α in (20). On the other hand, the assumed ranks constitute only the initial values for the tensor best rank(R _{1}, R _{2},…, R _{ P }) decomposition, since the real number of important factors is computed in each decomposition based on the eigenvalue ratio method, as described in Sect. 4.3.
What we have noticed is that the results depend on the chosen model window size W, but its value is not critical. That is, for many videos, good results are obtained for W in a certain range, such as 7–15 in this case, rather than for a particular value. This happens, because the model is continuously updated in accordance with the procedure, as described in Algorithm 4. In addition, a choice of the parameter G, controlling the series of consecutive “notfit” frames, is not a critical one. We set this value to 3 to avoid shot detection on some spurious but single frames. Such situation are usually due to some deterioration of a particular frame, e.g., due to scratches or dust if it was stored on the celluloid film tape, etc.
However, the results depend much on the parameters a and b of the fit function (30). Especially, this choice determines how many soft shots will be detected. For example, in the experiment which results are shown in Fig. 6, these were set to (a, b) = (3.0, 2.0) which rather excludes some of the fadein and fadeout types of shots.
Mean F values for different methods. Last column shows average results obtained with the presented method for the color and monochrome videos, respectively
We see that the proposed method performs comparatively or better than other methods, except for the VSCAN. It can be considered as good results taking into consideration that the method does not compute any specific features. The experiments showed that a difference between the color and monochrome video is about 0.03. After closer analysis, we noticed that using color leads to removal of some of the false positives. Nevertheless, the difference in accuracy between the color and monochrome versions is not too big, which can be explained by a small value of the color dimension, that is three, as well as by a correlation between the color channels. On the other hand, VSUMM and VSCAN rely on color histograms. This means that the proposed method can work with the monochrome videos, whereas the former cannot. In addition, the proposed method can almost automatically accommodate another dimensions, such as in the case of multiview or hyperspectral streams. When analyzing results of operation of the proposed method on various test video streams, we observed that frequently, it tends to detect more shots than in the userannotated surveys used for methods evaluation (false positives). This depends strongly on the chosen parameters, especially the ones controlling the frame fitness measure, as was discussed. Nevertheless, detected shots can also be considered as valid, since usually, there is a noticeable change of frame contents, as can be observed. However, such operation results in high recall parameter but rather poor precision at the same time, what explains results, as presented in Table 2.
Average execution time for the monochrome 2Dtensor and color 3Dtensor streams
Monochrome video (2D) 352 × 240  Color video (3D) 352 × 240  

Frames  1948  1948 
Processing time (frames/s)  15.6  3.2 
It is worth noticing that in timedemanding applications, the method can be further speeded up by processing each nth frame from the stream. Such strategy is used in some of the reported methods. However, in the presented framework, we did not use this option and all the tests were run with no frame decimation.
Conclusions
In this paper, the tensorbased method for video stream analysis and concept drift detection is presented. The method builds a tensor model from a number of frames from the input stream. The model is then continuously updated by the frames in the stream if they fit to the model based on the proposed fitness function. However, if for a number of consecutive frames in the stream, the fitness function indicates their high deviation from the model, the model is rebuilt starting from the current position. We proposed an efficient model construction method which is based on the best rank(R _{1}, R _{2},…, R _{ P }) tensor decomposition for processing of streams of tensor data.
The method was built and tested in the framework of video stream processing for detection of the video shots. The main benefit of using the tensor representation and decomposition is their natural ability to account for many internal dimensions of the frames constituting a video stream as well as its ability to operate with original signals with no necessity of feature detection. This can be beneficial in the case of multidimensional signals or measurements coming in a stream and for which their statistical properties are not known a priori. Nevertheless, features can be easily incorporated as the next dimensions of the input tensors. However, each additional dimension incurs a polynomial grow of the computational complexity, as well as memory consumption. The analysis of tensors joining signal and/or additional signal features is left for further research.
The method was adjusted to work with color and monochrome videos by designing a specific concept drift detector. The method was then tested on the OpenVideo database with monochrome and color versions of the test videos. For both types, the achieved accuracy can be compared favorably with majority of the video keyframe detection methods, especially designed to process exclusively color videos. Hence, the proposed tensor method has the ability to work with any type of video signals, such as the monochrome, color, hyperspectral, etc. with no a priori assumption on specific statistical properties of signals. On the other hand, we noticed relatively poor recognition of specific types of the soft temporal signal changes. This happens because if a change is not sufficiently abrupt to cause model rebuild, the adaptively progressing model gets gradually adapted to the slow/soft signal changes. However, these can be monitored by, e.g., higher standard deviation of the fit measure, although its level depends on the signal type. In addition, a better concept drift detector method can be designed which is also left for future research.
Nevertheless, the method requires significant memory and computational resources. To improve its performance, the fast method of dominant tensor subspace construction is proposed. Thanks to this, a significant speedup ratio was obtained when compared to the standard eigen problem solution based on the SVD decomposition. The tensor subspace method was further modified, as proposed in this paper, to increase its numerical stability and to allow for automatic determination of the dominating eigenvalues.
Further research will be oriented toward improvement of accuracy and speed of the proposed method, possibly by developing more efficient model fitness functions, as well as by development of parallel tensor decomposition methods.
Footnotes
Notes
Acknowledgements
This work was supported by the Polish National Science Center NCN under the Grant No. 2014/15/B/ST6/00609. This work was supported by the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wroclaw University of Science and Technology.
References
 1.Asghar, M.N., Hussain, F., Manton, R.: Video indexing: a survey. Int. J. Comput. Inf. Technol. 03(01), 148–169 (2014)Google Scholar
 2.de Avila, S.E.F., Lopes, A.P.B., da Luz Jr., A., Araújo, A.A.: VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn. Lett. 32, 56–68 (2011)CrossRefGoogle Scholar
 3.Bader, B.W., Kolda, T. G.: Algorithm 862: MATLAB tensor classes for fast algorithm prototyping. ACM Trans. Math. Softw. 32(4), 635–653 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
 4.Bellman, R.E.: Adaptive control processes: a guided tour. Princeton University, Princeton (1961)CrossRefzbMATHGoogle Scholar
 5.Cichocki, A., Zdunek, R., Amari, S.: Nonnegative matrix and tensor factorization. IEEE Signal Process. Mag. 25(1), 142–145 (2008)CrossRefzbMATHGoogle Scholar
 6.Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.I.: Nonnegative matrix and tensor factorizations. Applications to Exploratory Multiway Data Analysis and Blind Source Separation. Wiley, Hoboken (2009)Google Scholar
 7.CayllahuaCahuina, E.J.Y., CámaraChávez, G., Menotti, D.: A static video summarization approach with automatic shot detection using color histograms. In: Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), pp. 1–12 (2012)Google Scholar
 8.Cyganek, B., Krawczyk, B., Woźniak, M.: Multidimensional data classification with chordal distance based kernel and support vector machines. Eng. Appl. Artif. Intell. 46(A), 10–22 (2015)Google Scholar
 9.Cyganek, B.: An analysis of the road signs classification based on the higherorder singular value decomposition of the deformable pattern tensors. Advanced Concepts for Intelligent Vision Systems ACIVS 2010, LNCS 6475, pp. 191–202. Springer, Berlin (2010)Google Scholar
 10.Cyganek, B.: Object detection and recognition in digital images. Theory and Practice. Wiley, Hoboken (2013)Google Scholar
 11.Cyganek, B.: Object recognition with the higherorder singular value decomposition of the multidimensional prototype tensors. In: 3rd Computer Science Online Conference (CSOC 2014). Advances in Intelligent Systems and Computing. Springer, Berlin, pp. 395–405 (2014)Google Scholar
 12.Cyganek, B., Woźniak, M.: On robust computation of tensor classifiers based on the higherorder singular value decomposition. In: The 5th Computer Science Online Conference on Software Engineering Perspectives and Application in Intelligent Systems 2016 (CSOC2016). Advances in Intelligent Systems and Computing, vol. 465, pp. 193–201. Springer, Berlin (2016)Google Scholar
 13.DeRecLib, http://www.wiley.com/go/cyganekobject. Accessed 29 July 2017
 14.FabroDel, M., Böszörmenyi L.: Stateoftheart and future challenges in video scene detection: a survey. Multimedia Systems, vol. 19, Issue 5, pp 427–454, Springer, Berlin (2013)Google Scholar
 15.Fu, Y., Guo, Y., Zhu, Y., Liu, F., Song, C., Zhou, Z.H.: Multiview video summarization. IEEE Trans. Multimedia 12(7), 717–729 (2010)CrossRefGoogle Scholar
 16.Furini, M., Geraci, F., Montangero, M., Pellegrini, M.: STIMO: STIll and moving video storyboard for the web scenario. Multimed Tools Appl 46(1), 47–69 (2010)CrossRefGoogle Scholar
 17.Gama, J.: Knowledge Discovery from Data Streams. CRC Press, Boca Raton (2010)CrossRefzbMATHGoogle Scholar
 18.Gama, J., Žliobaitė I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation, ACM Computing Surveys (CSUR), Vol. 46, No. 4, pp. 44:1–44:37 (2014)Google Scholar
 19.Gao, Y., Wang, W.B., Yong, J.H., Gu, H.J.: Dynamic video summarization using twolevel redundancy detection, Multimedia Tools and Applications, pp. 233–250 (2009)Google Scholar
 20.Guan G, Wang Z, Yu K, Mei S, He M, Feng D.: Video summarization with global and local features. In: Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops, IEEE Computer Society, Washington, DC, pp. 570–575, 2012Google Scholar
 21.The Open Video Project, https://openvideo.org/. Accessed 29 July 2017
 22.VSUMM, https://sites.google.com/site/vsummsite/home. Accessed 29 July 2017
 23.VSCAN, https://sites.google.com/site/vscansite/home. Accessed 29 July 2017
 24.Kay, D.: Schaum's Outline of Tensor Calculus. McGrawHill (1988)Google Scholar
 25.Kiers, H.A.L.: Towards a standardized notation and terminology in multiway analysis. J. Chemom. 14, 105–122 (2000)MathSciNetCrossRefGoogle Scholar
 26.Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Review 51(3), 455–500 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 27.Kolda, T.G.: Orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl. 23(1), 243–255 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
 28.Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: a survey. Inf. Fusion. 37, 132–156 (2017)CrossRefGoogle Scholar
 29.Kuanar, S.K.: Video key frame extraction through dynamic Delaunay clustering with a structural constraint. J. Vis. Commun. Image Represent. 24(7), 1212–1227 (2013)CrossRefGoogle Scholar
 30.Lathauwer, de L.: Signal processing based on multilinear algebra. Ph.D. dissertation, Katholieke Universiteit Leuven (1997)Google Scholar
 31.de Lathauwer, L., de Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J Matrix Anal. Appl. 21(4), 1253–1278 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
 32.de Lathauwer, L., de Moor, B., Vandewalle, J.: On the best rank1 and rank(R _{1}, R _{2},…,R _{N}) approximation of higherorder tensors. SIAM J. Matrix Anal. Appl. 21(4), 1324–1342 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
 33.Lee, H., Yu, J., Im, Y., Gil, J.M., Park, D.: A unified scheme of shot boundary detection and anchor shot detection in news video story parsing. Multimedia Tools & Applications. 51, 1127–1145 (2011)CrossRefGoogle Scholar
 34.Li, Y.: On incremental and robust subspace learning. Pattern Recogn. 37, 1509–1518 (2004)CrossRefzbMATHGoogle Scholar
 35.Mahmoud, K.A., Ismail, M.A., Ghanem, N.M.: VSCAN: an enhanced video summarization using densitybased spatial clustering. Image analysis and processing–ICIAP 2013, vol. 1, pp. 733–742. LNCS Springer, Berlin (2013)Google Scholar
 36.Medentzidou, P., Kotropoulos, C.: Video summarization based on shot boundary detection with penalized contrasts. In: IEEE 9th international symposium on image and signal processing and analysis (ISPA), pp. 199–203 (2015)Google Scholar
 37.DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simplification. In: Proceedings of the sixth ACM international conference on Multimedia, ACM, pp. 211–218 (1998)Google Scholar
 38.Mundur, P., Rao, Y., Yesha, Y.: Keyframebased video summarization using Delaunay clustering. Internat. J. Dig. Libr. 6(2), 219–232 (2006)CrossRefGoogle Scholar
 39.Muti, D., Bourennane, S.: Survey on tensor signal algebraic filtering. Signal Process. 87, 237–249 (2007)CrossRefzbMATHGoogle Scholar
 40.Ou, S.H., Lee, C.H., Somayazulu, V.S., Chen, Y.K., Chien, S.Y.: Online multiview video summarization for wireless video sensor network. IEEE J. Sel. Topics Signal Process. 9(1), 165–179 (2015)CrossRefGoogle Scholar
 41.Porwik, P., Orczyk, T., Lewandowski, M., et al.: Feature projection kNN classifier model for imbalanced and incomplete medical data. Biocybern Biomed Eng 36(4), 644–656 (2016)CrossRefGoogle Scholar
 42.Savas, B., Eldén, L.: Handwritten digit classification using higher order singular value decomposition. Pattern Recogn. 40, 993–1003 (2007)CrossRefzbMATHGoogle Scholar
 43.Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: dynamic tensor analysis. KDD’06, Philadelphia, Pennsylvania, USA (2006)Google Scholar
 44.Sun, J., Tao, D., Faloutsos, C.: Incremental tensor analysis: theory and applications. ACM Trans. Knowl. Discov. Data 2(3), 11:1–11:37 (2008)Google Scholar
 45.Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM Trans. Multimed. Comput. Comm. Appl. 3(1), 1–37 (2007)Google Scholar
 46.Tucker, L.R.: Some mathematical notes on threemode factor analysis. Psychometrika 31, 279–311 (1966)MathSciNetCrossRefGoogle Scholar
 47.Valdes, V., Martinez, J.: Efficient video summarization and retrieval tools. International Workshop on ContentBased Multimedia Indexing, pp. 43–48 (2011)Google Scholar
 48.Vasilescu, M.A., Terzopoulos, D.: Multilinear analysis of image ensembles: TensorFaces. In: Proceedings of European Conference on Computer Vision, pp. 447–460 (2002)Google Scholar
 49.Vasilescu, M.A., Terzopoulos, D.: Multilinear independent component analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2005, Vol. 1, pp. 547–553 (2005)Google Scholar
 50.Vasilescu, M.A., Terzopoulos, D.: Multilinear (Tensor) image synthesis, analysis, and recognitioin. IEEE Signal Processing Magazine, pp. 118–123 (2007)Google Scholar
 51.Wang, H., Ahuja, N.: Compact Representation of Multidimensional Data Using Tensor RankOne Decomposition. In: Proceedings of the 17th International Conference on Pattern Recognition, Vol. 1, pp. 44–47 (2004)Google Scholar
 52.Wang, H., Ahuja, N.: A tensor approximation approach to dimensionality reduction. Int. J. Comput. Vision 76(3), 217–229 (2008)CrossRefGoogle Scholar
 53.Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)CrossRefGoogle Scholar
 54.Wu, Z., Xie W., Yu J.: Fuzzy Cmeans clustering algorithm based on kernel method. In: Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’03), pp. 1–6 (2003)Google Scholar
 55.Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in highdimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.