Detection and Localization of Video Object Removal by Spatio-Temporal LBP Coherence Analysis

Bai, Shanshan; Yao, Haichao; Ni, Rongrong; Zhao, Yao

doi:10.1007/978-3-030-34113-8_21

Detection and Localization of Video Object Removal by Spatio-Temporal LBP Coherence Analysis

Shanshan Bai^14,15,
Haichao Yao^14,15,
Rongrong Ni^14,15 &
…
Yao Zhao^14,15

Conference paper
First Online: 28 November 2019

1656 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11903))

Abstract

Local object removal on video can directly affect our understanding and cognition of the video content without changing the motion continuity of other moving objects in the same video frame. Forgers can use video editing tools or certain inpainting techniques to remove undesired objects easily for covering up the truth. In this paper, we present a new approach based on spatio-temporal LBP coherence analysis for detection and localization of forged regions, which are generated by removing unwanted objects from the video. The proposed method starts with frames alignment to handle camera motion. And then the coherence analysis on the spatial LBP operator between two adjacent frames is performed to find the possible forged region. Finally, the temporal LBP operator is utilized to remove the false positives so as to obtain the final abnormal area. Two common region-level inpainting methods are adopted to simulate two different types of forgery processes for performance evaluation of our scheme. The experimental results prove that our method is effective in detecting and locating the forged regions and superior to the existing two approaches.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Videos are generally regarded as unbiased and reliable records of events, and they have been widely used to provide basic evidences in many different fields. However, with the rapid development of digital media editing and inpainting techniques, it becomes easier for forgers to change the facts by removing undesired target from the video. Compared with the schemes that directly delete video frames containing the target, local removal of object does not destroy the continuity of other moving objects in the same frame. Tampered videos transmitted through the Internet can disrupt people’s daily lives, and even interfere with the normal social order.

Over the past few years, several video forensic methods for object removal have been proposed. Hsu $et\ al.$ [1] proposed an approach for detecting and locating the forged regions using block based correlation of noise residue. This method is based on the observation that correlation between temporal noise residue in forged regions of the frame is significantly different from that in the normal regions of a frame. While the noise correlation is unstable if the test videos suffer from abrupt illumination variations and sensitive to quantization noise. And the noise-residue correlation was also used to locate forgeries in [2,3,4]. Singh $et\ al.$ [5] proposed a sensor pattern noise based detection scheme, which is an improved and forensically stronger version of noise-residue based technique. Wang $et\ al.$ [6] developed a technique to uncover copy-paste forgeries in de-interlaced and interlaced videos using correlation coefficients. For de-interlaced video, tampering will destroy the correlations introduced by de-interlacing algorithms. Bestagini $et\ al.$ [7] proposed a similar approach to solve this problem and locate the forgeries in the spatio-temporal domain. Zhang $et\ al.$ [8] detected video forgery based on the ghost shadow artifact which is usually introduced when objects are removed by video inpainting technology. However, this method cannot accurately locate the forged regions and is vulnerable to the effects of noise. The technique proposed by Li $et\ al.$ [9] is to uncover object removal of surveillance videos with stationary background using motion vector correlation analysis. And it is based on the observation that the distribution of the motion vectors in the foreground area between the authentic video and the forged are quite different. Lin $et\ al.$ [10] analyzed the abnormalities in the spatio-temporal coherence between successive frames to detect and locate forged regions. But this approach only works well on uncompressed forged videos.

Inpainting techniques are used to fill the missing holes in a visually reasonable manner when unwanted objects are removed from the video. Temporal copy-and-paste (TCP) and exemplar-based texture synthesis (ETS) are two typical inpainting methods. The TCP method replaces the forged region with the most coherent area from the nearest frame, which leads to unnaturally high temporal coherence in the forged area. The ETS inpainting method proposed in [11] individually fills in the regions from sample textures for each frame, which leads to abnormally low temporal coherence in the forged region.

This paper aims to address the problem of detecting and locating forged regions based on the coherence analysis of spatio-temporal local binary patterns (LBP). LBP is a popular operator for describing the spatial structure of image texture, and it is not affected by illumination variations because of its invariance to monotonic gray level changes. It is robust to video compression since LBP describes the distribution of regional gray space, and compression does not change this relationship significantly. In view of its simplicity and effectiveness in image representation and classification, LBP and its variants have been applied in many research fields, such as facial image analysis and digital image/video forensics [12, 13]. The major procedures of the proposed algorithm are as follows: (i) the motion vector (MV) of the background for each frame is computed to align video frames so as to realize the preprocessing of video captured by mobile camera; (ii) the coherence analysis on the spatial LBP operator between two adjacent frames is performed to find the possible forged region; (iii) the temporal LBP is utilized to remove false positives and the final abnormal region is located. Our method can be applied to videos taken by moving cameras. And the experimental results prove that it is effective and relatively robust to detect the forged region manipulated by well known inpainting methods such as TCP and ETS in video sequences.

The remainder of this paper is organized as follows. Section 2 gives the details of the proposed video forgery detection scheme based on spatio-temporal LBP coherence analysis. The experimental results are presented in Sect. 3. Finally, Sect. 4 summarizes the highlights and discusses the future work.

2 Proposed Method

The proposed method aims to expose the traces of object removal forgery in static and dynamic scene videos. Our idea is to detect the forged regions manipulated by TCP and ETS by means of finding the region with abnormal temporal correlation, because video subjected to such forgery exhibits unnaturally high or low correlation between region of successive video frames. As shown in Fig. 1, the proposed detection scheme consists of three major steps: (i) frames alignment, (ii) spatial LBP (S-LBP) based forged regions detection, and (iii) temporal LBP (T-LBP) based false positives removal. The details of the proposed method are given in following subsections.

2.1 Frames Alignment

In order to handle video motion caused by camera movement or shaking of the mobile phone, we adopt a simple block matching motion estimation algorithm to obtain the motion vector of each video frame to achieve frames alignment. In this paper, we use $\{F_{1},F_{2},\cdots ,F_{L}\}$ to represent a video sequence V of length L, $L\in \mathbb {Z}^{+}$. And $F_{t}$ represents the $t^{th}$ frame, $\mathbb {Z}^{+}$ is the set of positive integers. It is obvious that the background motion vector $(Vx_{t}, Vy_{t})$ of $t^{th}$ frame can be utilized as the $t^{th}$ frame motion vector.

For computational efficiency, we first convert the video sequence from three-dimensional color space to two-dimensional grayscale space. Then each frame is divided into non-overlapping b blocks with $M\times N$ pixels, and the motion vector of the $i^{th}$ block of $t^{th}$ frame can be denoted by $(vx^{i}_{t}, vy^{i}_{t})$. We use the exhaustive search (ES) algorithm and mean absolute deviation (MAD) matching criterion for each block between successive frames $F_{t-1}$ and $F_{t}$ to find the most similar block, and then obtain the motion vector of each block. In typical applications, the area of foreground regions is usually much smaller than that of the background region. Based on this assumption, choose the most frequent $(vx^{i}_{t}, vy^{i}_{t})$ as the background motion of $F_{t}$ as follows:

$$\begin{aligned} Vx_{t}=mode\{vx^{1}_{t},vx^{2}_{t},\cdots ,vx^{b}_{t}\} \end{aligned}$$

(1)

$$\begin{aligned} Vy_{t}=mode\{vy^{1}_{t},vy^{2}_{t},\cdots ,vy^{b}_{t}\} \end{aligned}$$

(2)

where $mode(\cdot ,\cdot ,\cdots ,\cdot )$ denotes that the value with the highest frequency in parentheses will be selected as the result. After the motion vector of each frame is obtained, the pixels in frame $F_t$ are shifted by the cumulative vector $(Cx_{t}, Cy_{t})$ of the motion vectors of all frames before $F_t$. $Cx_{t}$ and $Cy_{t}$ can be calculated as follows:

$$\begin{aligned} Cx_{t}=\sum _{j=1}^{t}Vx_{j} \end{aligned}$$

(3)

$$\begin{aligned} Cy_{t}=\sum _{j=1}^{t}Vy_{j} \end{aligned}$$

(4)

2.2 Spatial LBP Based Forged Regions Detection

The spatial LBP (S-LBP) based forged regions detection is performed on the aligned frames. In this section, S-LBP is defined in $3\times 3$ window as shown in Fig. 2. We take the center pixel of the window as the threshold and compare the gray values of 8 adjacent pixels with it: if the surrounding pixel value is greater than the center, then the binary code of the corresponding position is 1, otherwise 0. Finally, the S-LBP coded frame SL of each original video frame is obtained. The definition of SL is given by Eqs. (5) and (6).

$$\begin{aligned} SL(x_{c},y_{c})=\sum _{p=0}^{P-1} s(g_{p}-g_{c})2^{p} \end{aligned}$$

(5)

$$\begin{aligned} s(x)= {\left\{ \begin{array}{ll} 1, \quad \quad x \ge 0\\ 0, \quad otherwise \end{array}\right. } \end{aligned}$$

(6)

where $(x_{c},y_{c})$ represent the coordinates of center pixel, p denotes the serial number of the sampling point around $(x_{c},y_{c})$, and $g_{c}$, $g_{p}$ represent the gray value of $(x_{c},y_{c})$ and its adjacent pixel respectively. P is the number of pixels around the center pixel, which is set to 8 here.

For analyzing the correlation between the previous frame $F_{t-1}$ and the current frame $F_{t}$, we first calculate the frame difference $S_{d}$ of two adjacent LBP frames. Then $S_{d}$ is divided into non-overlapping blocks, and the number of zeros in the histogram vector for each block is counted. If a block is forged, the number of zeros in the block varies (increased or decreased) substantially depending on the forgery scheme (TCP or ETS). Figure 3 shows the average distribution of histograms of block-level ($8\times 8$ block) differences between every two consecutive LBP frames in three different cases. Note that the ordinates of the three figures are different. Obviously, the numbers of zeros and the distributions of histograms in forged region are significantly different from those of the original area. As a result, the forged region and non-forged one can be distinguished by analyzing the number of zeros Q in the histogram vector in each block of $S_{d}$. The preliminary classification is defined as follows:

$$\begin{aligned} Class_{i}= {\left\{ \begin{array}{ll} 0, \quad T_{1}<Q<T_{2}\\ 1, \quad otherwise \end{array}\right. } \end{aligned}$$

(7)

where $Class_{i}$ denotes the binary classification mask of the $i^{th}$ block, and a value of 1 indicates that the block has been forged. $T_{1}$ and $T_{2}$ are thresholds for dividing the forged region and the normal. Finally, the pre-classification mask image of every original video frame is obtained by combining these block-level binary mask. Since the large smooth areas like sky can also lead to abnormally high correlation between two adjacent frames and interfere with the detection result, we elaborate the scheme for removing the false positive areas in the next section.

2.3 Temporal LBP Based False Positives Removal

In this section, temporal LBP (T-LBP) operator extended from the spatial domain is utilized to remove the false positives. The value of each pixel in the aligned frame obtained by Sect. 2.1 is computed by weighting the symmetric pixels within the range of 8 adjacent frames in temporal domain. That is, each T-LBP coded frame $TL_{t}$ carries the information of video frames within 8 neighborhoods (16 frames in total). The mathematical definition of $TL_{t}$ is as follows:

$$\begin{aligned} TL_{t}(x,y)=\sum _{r=1}^{R} s(G_{t-r}(x,y)-G_{t+r}(x,y))2^{r-1} \end{aligned}$$

(8)

where R is the neighborhood radius in temporal domain, which is set to 8 here. $G_{t}(x,y)$ represents the gray value of pixel point whose coordinates are (x, y) in the $t^{th}$ aligned frame. Figure 4 shows the pixel pairs and their weights in the process of calculating $TL_{t}(x,y)$. Thus, the TL sequence of length $L-16$ consisting of LBP-coded frames with the same size as the original video frames are obtained.

Large smooth areas causing false alarms are found by means of extracting the regions that remain stable for a period of time in TL sequence. The specific method is described as follows: similar to the previous section, we first calculate the frame difference between each current frame $TL_{t}$ and the first LBP-coded frame to obtain the difference sequence of length $L-17$. Then each difference frame is divided into non-overlapping blocks, and the number of zeros in the histogram vector of each block is counted. Finally, we convert each difference frame to a binary image based on a proper threshold as follows:

$$\begin{aligned} Class'_{i}= {\left\{ \begin{array}{ll} 1, \quad Q'>T_{3}\\ 0, \quad otherwise \end{array}\right. } \end{aligned}$$

(9)

where $Class'_{i}$ denotes the binary classification mask of the $i^{th}$ block, and a value of 1 indicates that the block belongs to the large smooth area. The binary mask image of each difference frame is obtained by combining these block-level binary mask. And the mask image of the smooth region is obtained after OR operation and mathematical morphological processing as follows:

$$\begin{aligned} smooth=((B_{1}\bigcup B_{2} \bigcup \cdots \bigcup B_{L-17}) \oplus E)\ominus E \end{aligned}$$

(10)

where $B_{t}$ is the binary image filtered by the threshold of each difference frame. $\bigcup $ is the logical OR operator. E is a structuring element. $\oplus $ and $\ominus $ represent the morphological close and open respectively. Finally, the false positives removal operation is performed on each pre-classification mask image given in Sect. 2.2 according to binary mask image of the smooth region, and the final binary classification image and the localization result are obtained, as shown in Fig. 5.

3 Experimental Results

To evaluate the performance of our method, twenty test video sequences were prepared for the experiments. We classify these videos into three groups according to their sources and the states of the video background: group I contains 7 test videos with still background which were obtained from SULFA data set [14], and the resolution of each frame is $320\times 240$ pixels. Group II contains 8 test videos that we have taken with static camera and group III contains the remaining 5 videos with dynamic background taken by ourselves. The resolution of each frame in group II and III is $352\times 288$ and the frame rate for all test videos is 30 fps. All the videos were forged by TCP and ETS inpainting methods respectively, and then re-encoded to H.264/AVC (with bitrates in the range of 1 Mbps to 5 Mbps) after the forgery. Through a large number of experiments, $T_{1}$, $T_{2}$ and $T_{3}$ are empirically set to 18, 63 and 30 for $8\times 8$ blocks.

As shown in Table 1, the detection performance is measured by precision rate P, recall rate R, and F1-score F1, which are calculated as below:

$$\begin{aligned} P=TP/(TP+FP) \end{aligned}$$

(11)

$$\begin{aligned} R=TP/(TP+FN) \end{aligned}$$

(12)

$$\begin{aligned} F1=(2\times P\times R)/(P+R) \end{aligned}$$

(13)

where TP denotes the number of correct detections, FP represents the number of false positives, and FN is the number of misses. Table 1 shows the average values of the experimental results for all videos in each group at 5 different bitrates. And it can be seen that the proposed method achieves high precision for both two video inpainting attacks, especially for TCP scheme. The performance of ETS tampered videos with a large amount of dynamic background is degraded because the errors of frames alignment operation make the authentic regions to be falsely classified as forged. Figure 6 shows the screenshots of the original frames, their inpainted frames forged by two inpainting schemes, and the corresponding localization results using the proposed method. The red blocks indicate the forged regions detected.

Table 1. Average performance of the proposed method for videos forged by TCP and ETS.

Full size table

In addition, we make a comparison between the proposed approach and the existing methods presented by Hsu $et\ al.$ [1] and Lin $et\ al.$ [10], and the comparison results are shown in Table 2. It can be seen that our approach outperforms the other two algorithms and it achieves higher performance especially for ETS inpainting attack. There is no mechanism to remove false positive regions in the noise residual based method [1], so the performance of data set in group II with large smooth areas is obviously decreased. And since it is not available to dynamic background, we have not shown the relevant experimental results of group III. The performance of [10] drops significantly for videos forged by ETS inpainting compared with TCP since this method relies heavily on the edge detection of forgery region, and it is difficult to accurately extract the region boundary forged by ETS. The LBP operator and its variants used in our method are not affected by the change of illumination due to their invariance to monotonic gray level changes. In addition, the frames alignment operation enables video captured by the mobile camera to be detected.

Table 2. Comparison results between our method and two existing schemes presented by Hsu $et\ al.$ [1] and Lin $et\ al.$ [10].

Full size table

In-depth analysis of the literatures revealed that the primary factors affecting the performance of inpainting detection techniques are the bitrates and compression quality of the test videos. Therefore, we present the forgery detection capabilities of these three forensic schemes for video sequences with bitrates in the range of 1 Mbps to 5 Mbps as shown in Fig. 7. It can be seen that the spatio-temporal LBP based approach still has high precision rate in the case of decreasing the bitrate. This is because the LBP operator and its variants describe the distribution of regional gray space, which does not change significantly during the compression process.

4 Conclusion

In this paper, we have presented a detection and localization method for video object removal forgery based on spatio-temporal LBP coherence analysis. We first perform frames alignment to handle camera motion. Then we use spatial LBP operator making coherence analysis to find the possible abnormal areas. Finally, the temporal LBP operator is utilized to remove the authentic regions that are falsely classified as forged to locate the final forged areas. In our experiments, two video inpainting schemes (TCP and ETS) are used to simulate two different types of tampering processes for performance evaluation. The experimental results prove that our method can detect and locate the forged regions effectively and keep stability with respect to decreased bitrates. It can also be applied to videos taken by mobile cameras or handheld phones. However, great shaking and even slightly rotating of the forged video will cause unsatisfactory experimental results. The main reason is that the coherence analysis does not work well under the above conditions because the difference between two normal frames can be very large. In the future, we will explore ways to solve the problems above and improve the scope of the applicability.

References

Hsu, C.-C., Hung, T.-Y., Lin, C.-W., Hsu, C.-T.: Video forgery detection using correlation of noise residue. In: 2008 IEEE 10th Workshop on Multimedia Signal Processing, pp. 170–174. IEEE (2008)
Google Scholar
Chetty, G.: Blind and passive digital video tamper detection based on multimodal fusion. In: Proceedings of 14th WSEAS International Conference on Communications, Corfu, Greece, pp. 109–117 (2010)
Google Scholar
Goodwin, J., Chetty, G.: Blind video tamper detection based on fusion of source features. In: 2011 International Conference on Digital Image Computing: Techniques and Applications, pp. 608–613. IEEE (2011)
Google Scholar
Pandey, R.C., Singh, S.K., Shukla, K.K.: Passive copy-move forgery detection in videos. In: 2014 International Conference on Computer and Communication Technology (ICCCT), pp. 301–306. IEEE (2014)
Google Scholar
Singh, R.D., Aggarwal, N.: Detection and localization of copy-paste forgeries in digital videos. Forensic Sci. Int. 281, 75–91 (2017)
Article Google Scholar
Wang, W., Farid, H.: Exposing digital forgeries in video by detecting duplication. In: Proceedings of the 9th Workshop on Multimedia & security, pp. 35–42. ACM (2007)
Google Scholar
Bestagini, P., Milani, S., Tagliasacchi, M., Tubaro, S.: Local tampering detection in video sequences. In: 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP), pp. 488–493. IEEE (2013)
Google Scholar
Zhang, J., Su, Y., Zhang, M.: Exposing digital video forgery by ghost shadow artifact. In: Proceedings of the First ACM Workshop on Multimedia in Forensics, pp. 49–54. ACM (2009)
Google Scholar
Li, L., Wang, X., Zhang, W., Yang, G., Hu, G.: Detecting removed object from video with stationary background. In: Shi, Y.Q., Kim, H.-J., Pérez-González, F. (eds.) IWDW 2012. LNCS, vol. 7809, pp. 242–252. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40099-5_20
Chapter Google Scholar
Lin, C.-S., Tsay, J.-J.: A passive approach for effective detection and localization of region-level video forgery with spatio-temporal coherence analysis. Digit. Invest. 11(2), 120–140 (2014)
Article Google Scholar
Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004)
Article Google Scholar
Li, L., Li, S., Zhu, H., Chu, S.-C., Roddick, J.F., Pan, J.-S.: An efficient scheme for detecting copy-move forged images by local binary patterns. J. Inf. Hiding Multimedia Signal Process. 4(1), 46–56 (2013)
Google Scholar
Zhang, Z., Hou, J., Ma, Q., Li, Z.: Efficient video frame insertion and deletion detection based on inconsistency of correlations between local binary pattern coded frames. Secur. Commun. Netw. 8(2), 311–320 (2015)
Article Google Scholar
Qadir, G., Yahaya, S., Ho, A.T.S.: Surrey university library for forensic analysis (SULFA) of video content (2012)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development of China (2016YFB0800404), National NSF of China (61672090, 61532005), and Fundamental Research Funds for the Central Universities (2018JBZ001).

Author information

Authors and Affiliations

Institute of Information Science, Beijing Jiaotong University, Beijing, 100044, China
Shanshan Bai, Haichao Yao, Rongrong Ni & Yao Zhao
Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, 100044, China
Shanshan Bai, Haichao Yao, Rongrong Ni & Yao Zhao

Authors

Shanshan Bai
View author publications
You can also search for this author in PubMed Google Scholar
Haichao Yao
View author publications
You can also search for this author in PubMed Google Scholar
Rongrong Ni
View author publications
You can also search for this author in PubMed Google Scholar
Yao Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rongrong Ni .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Peking, China
Baoquan Chen
The Technical University of Munich, München, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bai, S., Yao, H., Ni, R., Zhao, Y. (2019). Detection and Localization of Video Object Removal by Spatio-Temporal LBP Coherence Analysis. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11903. Springer, Cham. https://doi.org/10.1007/978-3-030-34113-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-34113-8_21
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34112-1
Online ISBN: 978-3-030-34113-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)