Skip to main content

Temporal capsule networks for video motion estimation and error concealment


In this paper, we present a temporal capsule network architecture to encode motion in videos as an instantiation parameter. The extracted motion is used to perform motion-compensated error concealment. We modify the original architecture and use a carefully curated dataset to enable the training of capsules spatially and temporally. First, we add the temporal dimension by taking co-located “patches” from three consecutive frames obtained from standard video sequences to form input data “cubes.” Second, the network is designed with an initial feature extraction layer that operates on all three dimensions to generate spatiotemporal features. Additionally, we implement the PrimaryCaps module with a recurrent layer, instead of a conventional convolutional layer, to extract short-term motion-related temporal dependencies and encode them as activation vectors in the capsule output. Finally, the capsule output is combined with the most-recent past frame and passed through a fully connected reconstruction network to perform motion-compensated error concealment. We study the effectiveness of temporal capsules by comparing the proposed model with architectures that do not include capsules. Although the quality of the reconstruction shows room for improvement, we successfully demonstrate that capsules-based architectures can be designed to operate in the temporal dimension to encode motion-related attributes as instantiation parameters. The accuracy of motion estimation is evaluated by comparing both the reconstructed frame outputs and the corresponding optical flow estimates with ground truth data.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. Le Cun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D., et al.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems (1990)

  2. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)

    Article  Google Scholar 

  3. Krizhenvshky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional networks. In: Proceedings of the Conference Neural Information Processing Systems (NIPS), Lake Tahoe, 3–8 Dec 2012, pp. 1097–1105

  4. Geoffrey, E., Hinton, A.K., Sida D.W.: Transforming auto-encoders. In: International Conference on Artificial Neural Networks, Springer, pp. 44–51, 2011

  5. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks”. Proc. NIPS 60, 1106–1114 (2012)

    Google Scholar 

  7. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: IEEE ICLR, 2015

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016

  9. Cai, N., Su, Z., Lin, Z., Wang, H., Yang, Z., et al.: Blind inpainting using the fully convolutional neural network”. The Visual Computer 33, 1–13 (2015)

    Google Scholar 

  10. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: IEEE Conference on Computer Vision and Pattern Recognition, 2016

  11. Varga, D., Szirányi, T.: No-reference video quality assessment via pretrained CNN and LSTM networks. SIViP 13, 1569 (2019).

    Article  Google Scholar 

  12. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: CoRR, abs/1406.2199, Proc. NIPS, 2014

  13. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition, 2017

  14. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: edge-preserving interpolation of correspondences for optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1164–1172, 2015

  15. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: “Deepflow: large displacement optical flow with deep matching. In: IEEE International Conference on Computer Vision (ICCV), 2013

  16. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: IEEE ICLR, Toulon, pp. 1–10, 24–26 Apr 2017

  17. Xue, T. Wu, J., Bouman, K., Freeman, B.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In: Proc. NIPS, pp. 91–99, 2016

  18. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. arXiv:1606.03498, 2016

  19. Liang, X., Lee, L. Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. In: ICCV, 2017

  20. Kappeler, A., Yoo, S., Dai, Q., Katsaggelos, A.K.: Video super-resolution with convolutional neural networks. IEEE Trans. Comput. Imaging 2, 109–122 (2016)

    MathSciNet  Article  Google Scholar 

  21. Lucas, A., Lopez-Tapia, S., Molina, R., Katsaggelos, A.K.: Generative Adversarial Networks and Perceptual Losses for Video Super-Resolution. IEEE Trans. Image Process 28, 3312–3327 (2019)

    MathSciNet  Article  Google Scholar 

  22. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Proc. NIPS, pp. 3857–3867, 2017

  23. Hinton, G.E., Frosst, N., Sabour, S.: Matrix capsules with em routing. In: IEEE, International Conference on Learning Representations, 2018

  24. Islam, K., Perez, D., Hill, V., Schaeffer, B., Zimmerman, R. Li, J.: Seagrass detection in coastal water through deep capsule networks. In: Chinese Conference on Pattern Recognition and Computer Vision, Sun-Yat Sen University, 2018

  25. Perez, D., Islam, K.A., Schaeffer, B., Zimmerman, R., Hill, V., Li, J.: Deepcoast: quantifying seagrass distribution in coastal water through deep capsule networks. In: The First Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2018

  26. Afshar, P., Mohammadi, A., Plataniotis, K.N.: Brain tumor type classification via capsule networks. arXiv preprint arXiv:1802.10200, 2018

  27. Zhao, W. Ye, J., Yang, M., Lei, Z., Zhang, S., Zhao, Z.: Investigating capsule networks with dynamic routing for text classification. In: Computing Research Repository Version 1, arXiv:1804.00857

  28. Jaiswal, A., AbdAlmageed, W., Natarajan, P.: CapsuleGAN: generative adversarial capsule network. 2018. arXiv:1802.06167

  29. Xiang, C., Zhang, L., Tang, Y., Zou, W., Xu, C.: MS-CapsNet: a novel multi-scale capsule network. IEEE Signal Process. Lett. 25(12), 1850–1854 (2018)

    Article  Google Scholar 

  30. Mallea, M.D.G., Meltzer, P., Bentley, P.J.: Capsule neural networks for graph classification using explicit tensorial graph representations. arXiv preprint arXiv:1902.08399, 2019

  31. Video Dataset:

  32. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

  33. Sankisa, A., Pandremmenou, K., Pahalawatta, P.V., Kondi, L.P., Katsaggelos, A.K.: SSIM-based distortion estimation for optimized video transmission over inherently noisy channels. IJMDEM 7(3), 34–52 (2016)

    Google Scholar 

  34. Sankisa, A., Pandremmenou, K., Kondi, L.P., Katsaggelos, A.K.: A novel cumulative distortion metric and a no-reference sparse prediction model for packet prioritization in encoded video transmission. In: Proceedings of IEEE ICIP, pp. 2097–2101, 2016

  35. Capsule Network and Dynamic Routing implementation:

  36. Zheng, J., Wang, H., Pei, B.: Robust optical flow estimation based on wavelet. SIViP 13, 1303–1310 (2019).

    Article  Google Scholar 

  37. Fischer, P., et al.: FlowNet: learning optical flow with convolutional networks., 2015

  38. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: IEEE International Conference on Learning Representations, 2016

  39. Niklaus S., Liu, F.: Context-aware synthesis for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1710, 2018

  40. Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion gan for future-flow embedded video prediction. In: IEEE International Conference on Computer Vision, 2017

  41. Zhang, T., Jiang, P., Zhang, M.: Inter-frame video image generation based on spatial continuity generative adversarial networks. SIViP 13, 1487 (2019).

    Article  Google Scholar 

  42. Sankisa, A., Punjabi, A., Katsaggelos, A.K.: Video error concealment using deep neural networks. In: Proceedings of IEEE ICIP, pp. 380–384, 2018

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Arun Sankisa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sankisa, A., Punjabi, A. & Katsaggelos, A.K. Temporal capsule networks for video motion estimation and error concealment. SIViP 14, 1369–1377 (2020).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Capsule networks
  • Conv3D
  • ConvLSTM
  • Error concealment
  • Motion estimation