Abstract
Facial manipulation techniques have aroused increasing security concerns, leading to various methods to detect forgery videos. However, existing methods suffer from a significant performance gap compared to image manipulation methods, partially because the spatio-temporal information is not well explored. To address the issue, we introduce a Hybrid Spatio-Temporal Network (HSTNet) to integrate spatial and temporal information in the same framework. Specifically, our HSTNet utilizes a hybrid architecture, which consists of a 3D CNN branch and a transformer branch, to jointly learn short- and long-range relations in the spatio-temporal dimension. Due to the feature misalignment between the two branches, we design a Feature Alignment Block (FAB) to recalibrate and efficiently fuse heterogeneous features. Moreover, HSTNet introduces a Vector Selection Block (VSB) to combine the outputs of the two branches and fire important features for classification. Extensive experiments show that HSTNet obtains the best overall performance over state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Afchar, D., Nozick, V., Yamagishi, J., Echizen, I.: Mesonet: a compact facial video forgery detection network. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1ā7. IEEE (2018)
Amerini, I., Galteri, L., Caldelli, R., Del Bimbo, A.: Deepfake video detection through optical flow based CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., LuÄiÄ, M., Schmid, C.: ViViT: a video vision transformer (2021)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016)
Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: Towards open-set identity preserving face synthesis. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6713ā6722 (2018)
Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipulation detection using a new convolutional layer. In: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, pp. 5ā10 (2016)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213ā229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299ā6308 (2017)
Chollet, F.: Xception: Deep learning with DepthWise separable convolutions. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1251ā1258 (2017)
Contributors, M.: Openmmlabās next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2 (2020)
Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5781ā5790 (2020)
Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: Retinaface: single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5203ā5212 (2020)
Dolhansky, B., et al.: The deepfake detection challenge (DFDC) dataset (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020)
Du, M., Pentyala, S., Li, Y., Hu, X.: Towards generalizable forgery detection with locality-aware autoencoder. pp. arXiv-1909 (2019)
Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7890ā7899 (2020)
Goodfellow, I., et al.: Generative adversarial nets, vol. 27 (2014)
Gu, Z., et al.: Spatiotemporal inconsistency learning for deepfake video detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3473ā3481 (2021)
Haliassos, A., Vougioukas, K., Petridis, S., Pantic, M.: Lips donāt lie: a generalisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5039ā5049 (2021)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546ā6555 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770ā778 (2016)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs) (2016)
Islam, M.A., Kowal, M., Jia, S., Derpanis, K.G., Bruce, N.D.: Position, padding and predictions: a deeper look at position information in CNNs (2021)
Jiang, Z., et al.: Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet (2021)
Khodabakhsh, A., Ramachandra, R., Raja, K., Wasnik, P., Busch, C.: Fake face detection methods: can they be generalized? In: 2018 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1ā6. IEEE (2018)
Li, J., Xie, H., Li, J., Wang, Z., Zhang, Y.: Frequency-aware discriminative feature learning supervised by single-center loss for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6458ā6467 (2021)
Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: Advancing high fidelity identity swapping for forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5074ā5083 (2020)
Li, L., et al.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5001ā5010 (2020)
Li, X., et al.: Sharp multiple instance learning for deepfake video detection. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1864ā1872 (2020)
Li, Y., Chang, M.C., Lyu, S.: In ICTU oculi: Exposing AI created fake videos by detecting eye blinking. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1ā7. IEEE (2018)
Li, Y., Lyu, S.: Exposing deepfake videos by detecting face warping artifacts (2018)
Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-DF: a large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3207ā3216 (2020)
Liu, H., et al.: Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 772ā781 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017)
Mao, M., et al.: Dual-stream network for visual recognition (2021)
Masi, I., Killekar, A., Mascarenhas, R.M., Gurudatt, S.P., AbdAlmageed, W.: Two-branch recurrent network for isolating deepfakes in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 667ā684. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_39
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emotions donāt lie: An audio-visual deepfake detection method using affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2823ā2832 (2020)
Nguyen, H.H., Fang, F., Yamagishi, J., Echizen, I.: Multi-task learning for detecting and segmenting manipulated facial images and videos (2019)
Peng, Z., et al.: Conformer: local features coupling global representations for visual recognition (2021)
Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: face forgery detection by mining frequency-aware clues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 86ā103. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_6
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., NieĆner, M.: Faceforensics++: learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1ā11 (2019)
Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I., Natarajan, P.: Recurrent convolutional strategies for face manipulation detection in videos. Interfaces 3, 80ā87 (2019)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618ā626 (2017)
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., NieĆner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 716ā731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks (2015)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998ā6008 (2017)
Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven realistic facial animation with temporal GANs. In: IEEE conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 37ā40 (2019)
Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: CNN-generated images are surprisingly easy to spot... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8695ā8704 (2020)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions (2021)
Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N.: Multi-attentional deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2185ā2194 (2021)
Zheng, Y., Bao, J., Chen, D., Zeng, M., Wen, F.: Exploring temporal coherence for more general video face forgery detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15044ā15054 (2021)
Acknowledegments
This work was supported by āOne Thousand Planā projects in Jiangxi Province Jxsg2023102268 and National Key Laboratory on Automatic Target Recognition 220402.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, X. et al. (2023). Hybrid Spatio-Temporal Network forĀ Face Forgery Detection. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14408. Springer, Cham. https://doi.org/10.1007/978-3-031-47665-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-47665-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47664-8
Online ISBN: 978-3-031-47665-5
eBook Packages: Computer ScienceComputer Science (R0)