Abstract
In this paper, we present a temporal capsule network architecture to encode motion in videos as an instantiation parameter. The extracted motion is used to perform motion-compensated error concealment. We modify the original architecture and use a carefully curated dataset to enable the training of capsules spatially and temporally. First, we add the temporal dimension by taking co-located “patches” from three consecutive frames obtained from standard video sequences to form input data “cubes.” Second, the network is designed with an initial feature extraction layer that operates on all three dimensions to generate spatiotemporal features. Additionally, we implement the PrimaryCaps module with a recurrent layer, instead of a conventional convolutional layer, to extract short-term motion-related temporal dependencies and encode them as activation vectors in the capsule output. Finally, the capsule output is combined with the most-recent past frame and passed through a fully connected reconstruction network to perform motion-compensated error concealment. We study the effectiveness of temporal capsules by comparing the proposed model with architectures that do not include capsules. Although the quality of the reconstruction shows room for improvement, we successfully demonstrate that capsules-based architectures can be designed to operate in the temporal dimension to encode motion-related attributes as instantiation parameters. The accuracy of motion estimation is evaluated by comparing both the reconstructed frame outputs and the corresponding optical flow estimates with ground truth data.
Similar content being viewed by others
References
Le Cun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D., et al.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems (1990)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Krizhenvshky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional networks. In: Proceedings of the Conference Neural Information Processing Systems (NIPS), Lake Tahoe, 3–8 Dec 2012, pp. 1097–1105
Geoffrey, E., Hinton, A.K., Sida D.W.: Transforming auto-encoders. In: International Conference on Artificial Neural Networks, Springer, pp. 44–51, 2011
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks”. Proc. NIPS 60, 1106–1114 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: IEEE ICLR, 2015
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016
Cai, N., Su, Z., Lin, Z., Wang, H., Yang, Z., et al.: Blind inpainting using the fully convolutional neural network”. The Visual Computer 33, 1–13 (2015)
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: IEEE Conference on Computer Vision and Pattern Recognition, 2016
Varga, D., Szirányi, T.: No-reference video quality assessment via pretrained CNN and LSTM networks. SIViP 13, 1569 (2019). https://doi.org/10.1007/s11760-019-01510-8
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: CoRR, abs/1406.2199, Proc. NIPS, 2014
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition, 2017
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: edge-preserving interpolation of correspondences for optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1164–1172, 2015
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: “Deepflow: large displacement optical flow with deep matching. In: IEEE International Conference on Computer Vision (ICCV), 2013
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: IEEE ICLR, Toulon, pp. 1–10, 24–26 Apr 2017
Xue, T. Wu, J., Bouman, K., Freeman, B.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In: Proc. NIPS, pp. 91–99, 2016
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. arXiv:1606.03498, 2016
Liang, X., Lee, L. Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. In: ICCV, 2017
Kappeler, A., Yoo, S., Dai, Q., Katsaggelos, A.K.: Video super-resolution with convolutional neural networks. IEEE Trans. Comput. Imaging 2, 109–122 (2016)
Lucas, A., Lopez-Tapia, S., Molina, R., Katsaggelos, A.K.: Generative Adversarial Networks and Perceptual Losses for Video Super-Resolution. IEEE Trans. Image Process 28, 3312–3327 (2019)
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Proc. NIPS, pp. 3857–3867, 2017
Hinton, G.E., Frosst, N., Sabour, S.: Matrix capsules with em routing. In: IEEE, International Conference on Learning Representations, 2018
Islam, K., Perez, D., Hill, V., Schaeffer, B., Zimmerman, R. Li, J.: Seagrass detection in coastal water through deep capsule networks. In: Chinese Conference on Pattern Recognition and Computer Vision, Sun-Yat Sen University, 2018
Perez, D., Islam, K.A., Schaeffer, B., Zimmerman, R., Hill, V., Li, J.: Deepcoast: quantifying seagrass distribution in coastal water through deep capsule networks. In: The First Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2018
Afshar, P., Mohammadi, A., Plataniotis, K.N.: Brain tumor type classification via capsule networks. arXiv preprint arXiv:1802.10200, 2018
Zhao, W. Ye, J., Yang, M., Lei, Z., Zhang, S., Zhao, Z.: Investigating capsule networks with dynamic routing for text classification. In: Computing Research Repository Version 1, arXiv:1804.00857
Jaiswal, A., AbdAlmageed, W., Natarajan, P.: CapsuleGAN: generative adversarial capsule network. 2018. arXiv:1802.06167
Xiang, C., Zhang, L., Tang, Y., Zou, W., Xu, C.: MS-CapsNet: a novel multi-scale capsule network. IEEE Signal Process. Lett. 25(12), 1850–1854 (2018)
Mallea, M.D.G., Meltzer, P., Bentley, P.J.: Capsule neural networks for graph classification using explicit tensorial graph representations. arXiv preprint arXiv:1902.08399, 2019
Xiph.org Video Dataset: https://media.xiph.org/video/derf/
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012
Sankisa, A., Pandremmenou, K., Pahalawatta, P.V., Kondi, L.P., Katsaggelos, A.K.: SSIM-based distortion estimation for optimized video transmission over inherently noisy channels. IJMDEM 7(3), 34–52 (2016)
Sankisa, A., Pandremmenou, K., Kondi, L.P., Katsaggelos, A.K.: A novel cumulative distortion metric and a no-reference sparse prediction model for packet prioritization in encoded video transmission. In: Proceedings of IEEE ICIP, pp. 2097–2101, 2016
Capsule Network and Dynamic Routing implementation: https://github.com/XifengGuo/CapsNet-Keras
Zheng, J., Wang, H., Pei, B.: Robust optical flow estimation based on wavelet. SIViP 13, 1303–1310 (2019). https://doi.org/10.1007/s11760-019-01476-7
Fischer, P., et al.: FlowNet: learning optical flow with convolutional networks. https://arxiv.org/pdf/1504.06852.pdf, 2015
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: IEEE International Conference on Learning Representations, 2016
Niklaus S., Liu, F.: Context-aware synthesis for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1710, 2018
Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion gan for future-flow embedded video prediction. In: IEEE International Conference on Computer Vision, 2017
Zhang, T., Jiang, P., Zhang, M.: Inter-frame video image generation based on spatial continuity generative adversarial networks. SIViP 13, 1487 (2019). https://doi.org/10.1007/s11760-019-01499-0
Sankisa, A., Punjabi, A., Katsaggelos, A.K.: Video error concealment using deep neural networks. In: Proceedings of IEEE ICIP, pp. 380–384, 2018
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sankisa, A., Punjabi, A. & Katsaggelos, A.K. Temporal capsule networks for video motion estimation and error concealment. SIViP 14, 1369–1377 (2020). https://doi.org/10.1007/s11760-020-01671-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-020-01671-x