Learning Sequence Representations by Non-local Recurrent Neural Memory

Pei, Wenjie; Feng, Xin; Fu, Canmiao; Cao, Qiong; Lu, Guangming; Tai, Yu-Wing

doi:10.1007/s11263-022-01648-y

Learning Sequence Representations by Non-local Recurrent Neural Memory

Published: 14 August 2022

Volume 130, pages 2532–2552, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Wenjie Pei¹^na1,
Xin Feng¹^na1,
Canmiao Fu²,
Qiong Cao³,
Guangming Lu ORCID: orcid.org/0000-0003-1578-2634¹ &
…
Yu-Wing Tai⁴

460 Accesses
1 Altmetric
Explore all metrics

Abstract

The key challenge of sequence representation learning is to capture the long-range temporal dependencies. Typical methods for supervised sequence representation learning are built upon recurrent neural networks to capture temporal dependencies. One potential limitation of these methods is that they only model one-order information interactions explicitly between adjacent time steps in a sequence, hence the high-order interactions between nonadjacent time steps are not fully exploited. It greatly limits the capability of modeling the long-range temporal dependencies since the temporal features learned by one-order interactions cannot be maintained for a long term due to temporal information dilution and gradient vanishing. To tackle this limitation, we propose the non-local recurrent neural memory (NRNM) for supervised sequence representation learning, which performs non-local operations by means of self-attention mechanism to learn full-order interactions within a sliding temporal memory block and models global interactions between memory blocks in a gated recurrent manner. Consequently, our model is able to capture long-range dependencies. Besides, the latent high-level features contained in high-order interactions can be distilled by our model. We validate the effectiveness and generalization of our NRNM on three types of sequence applications across different modalities, including sequence classification, step-wise sequential prediction and sequence similarity learning. Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning deep hierarchical and temporal recurrent neural networks with residual learning

Article 29 January 2020

Learning Sparse Hidden States in Long Short-Term Memory

Enhancing spatiotemporal predictive learning: an approach with nested attention module

Article 20 February 2024

Notes

References

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450
Bahdanau, D., Cho, K., & Bengio Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR.
Bello, I., Zoph, B., Vaswani, A., et al. (2019). Attention augmented convolutional networks. In ICCV.
Bertolami, R., Bunke, H., Fernandez S., et al. (2009). A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5).
Brock, A., Donahue, J., Simonyan K. (2019). Large scale gan training for high fidelity natural image synthesis. In ICLR.
Buades, A., Coll, B., Morel, J. M. (2005). A non-local algorithm for image denoising. In CVPR.
Cao, Y., Xu, J., Lin S., et al. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In ICCV workshops.
Carion, N., Massa, F., Synnaeve G., et al. (2020). End-to-end object detection with transformers. In ECCV.
Carreira, J., & Zisserman A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In “CVPR”.
Cho, K., Van Merriënboer, B., Gulcehre C., et al. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Choutas, V., Weinzaepfel, P., Revaud, J., et al. (2018). Potion: Pose motion representation for action recognition. In “CVPR”.
Dai, A. M., & Le, Q. V. (2015). Semi-supervised sequence learning. In NeurIPS.
Dai, Z., Yang, Z., Yang, Y., et al. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. In ACL.
Devlin, J., Chang, M. W., Lee, K., et al. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
Dieng, A. B., Wang, C., Gao, J., et al. (2017). Topicrnn: A recurrent neural network with long-range semantic dependency. In ICLR.
Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.
Feichtenhofer, C., Fan, H., Malik, J., et al. (2019). Slowfast networks for video recognition. In ICCV.
Fu, C., Pei, W., Cao, Q., et al. (2019a). Non-local recurrent neural memory for supervised sequence modeling. In ICCV.
Fu, J., Liu, J., Tian, H., et al. (2019b). Dual attention network for scene segmentation. In CVPR.
Fu, J., Liu, J., Tian, H., et al. (2019c). Dual attention network for scene segmentation. In CVPR.
Grave, E., Joulin, A., Usunier, N. (2017). Improving neural language models with a continuous cache. In ICLR.
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines. arXiv preprint arXiv:1410.5401
He, K., Gkioxari, G., Dollár, P., et al. (2017). Mask R-CNN. In ICCV.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., et al. (2001). Gradient flow in recurrent nets: The difficulty of learning long-term dependencies
Hu, H., Zhang, Z., Xie, Z., et al. (2019). Local relation networks for image recognition. In ICCV.
Huang, G., Liu, Z., Van Der Maaten, L., et al. (2017). Densely connected convolutional networks. In CVPR.
Johnson, R., & Zhang, T. (2014). Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058
Johnson, R., & Zhang, T. (2016). Supervised and semi-supervised text categorization using LSTM for region embeddings. In ICML.
Kay, W., Carreira, J., Simonyan, K., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Kingma, D. P., & Ba, J.(2015). Adam: A method for stochastic optimization. In ICLR.
Krueger, D., Maharaj, T., Kramár, J., et al. (2017). Zoneout: Regularizing rnns by randomly preserving hidden activations. In ICLR.
Kumar, A., Irsoy, O., Ondruska, P., et al. (2016). Ask me anything: Dynamic memory networks for natural language processing. In ICML.
Lafferty, J. D., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.
Lee, I., Kim, D., Kang S., et al. (2017). Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In ICCV.
Li, Z., & Yu, Y. (2016). Protein secondary structure prediction using cascaded convolutional and recurrent neural networks. arXiv preprint arXiv:1604.07176
Liu, J., Shahroudy, A., Xu, D., et al. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In ECCV.
Liu, Z., Lin, Y., Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
Maas, A. L., Daly, R. E., Pham, P. T., et al. (2011). Learning word vectors for sentiment analysis. In ACL.
McCann, B., Bradbury, J., Xiong, C., et al. (2017). Learned in translation: Contextualized word vectors. In NeurIPS.
Miyato, T., Dai, A. M., Goodfellow, I. (2017). Adversarial training methods for semi-supervised text classification. In ICLR.
Morency, L. P., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for continuous gesture recognition. In CVPR.
Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In AAAI.
Pei, W., Tax, D. M., & van der Maaten, L. (2016). Modeling time series similarity with siamese recurrent networks. arXiv preprint arXiv:1603.04713
Pei, W., Baltrusaitis, T., Tax, D. M., et al. (2017). Temporal attention-gated model for robust sequence classification. In CVPR.
Pei, W., Dibeklioğlu, H., Tax, D. M., et al. (2018). Multivariate time-series classification using the hidden-unit logistic model. IEEE Transactions on Neural Networks and Learning Systems, 29(4), 920–931.
Article Google Scholar
Peng, J., Bo, L., & Xu J. (2009). Conditional neural fields. In NeurIPS.
Pollastri, G., Przybylski, D., Rost, B., et al. (2002). Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics 47(2):228–235
Qiu, Z., et al. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV.
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Article Google Scholar
Radford, A., Jozefowicz, R., & Sutskever, I. (2017). Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444
Ramachandran, P., Parmar, N., Vaswani, A et al. (2019) Stand-alone self-attention in vision models. In NeurIPS.
Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations by back-propagating errors. Cognitive Modeling, 5(3), 1.
MATH Google Scholar
Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
Santoro, A., Faulkner, R., Raposo, D., et al. (2018). Relational recurrent neural networks. In NeurIPS.
Shahroudy, A., Liu, J., Ng, T. T., et al. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR.
Si, C., Jing, Y., Wang, W., et al. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV.
Sigurdsson, G. A., Varol, G., Wang, X., et al. (2016). Hollywood in homes: Crowdsourcing data collection for activity understanding. In “ECCV”.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In “NIPS”.
Soltani, R., & Jiang, H. (2016). Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064
Song, S., Lan, C., Xing, J., et al. (2018). Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Transactions on Image Processing, 27(7), 3459–3471.
Article MathSciNet Google Scholar
Srinivas, A., Lin, T. Y., Parmar, N., et al. (2021). Bottleneck transformers for visual recognition. In CVPR.
Sukhbaatar, S., Weston, J., Fergus, R., et al. (2015). End-to-end memory networks. In NeurIPS.
Szegedy, C., Liu, W., Jia, Y., et al. (2015).Going deeper with convolutions. In CVPR.
Tan, C., Wei, F., Wang, W., et al. (2018).Multiway attention networks for modeling sentence pairs. In IJCAI.
Tu, J., Liu, H., Meng, F., et al. (2018). Spatial-temporal data augmentation based on LSTM autoencoder network for skeleton-based human action recognition. In ICIP.
Van Der Maaten, L., Welling, M., & Saul, L. (2011). Hidden-unit conditional random fields. In Proceedings of the fourteenth international conference on artificial intelligence and statistics.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In NeurIPS.
Wang, G., & Dunbrack, R. L., Jr. (2003). Pisces: A protein sequence culling server. Bioinformatics, 19(12), 1589–1591.
Article Google Scholar
Wang, H., & Wang, L. (2017). Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In CVPR.
Wang, S., Peng, J., Ma, J., et al. (2016). Protein secondary structure prediction using deep convolutional neural fields. Scientific Reports, 6(1), 1–11.
Article Google Scholar
Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In “ECCV”.
Wang, X., Girshick, R., Gupta, A., et al. (2018). Non-local neural networks. In CVPR.
Wang, Z., Zhao, F., Peng, J., et al. (2010). Protein 8-class secondary structure prediction using conditional neural fields. In 2010 IEEE International conference on bioinformatics and biomedicine (BIBM) (pp. 109–114).
Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks. In ICLR.
Wu, C. Y., Feichtenhofer, C., Fan, H., et al. (2019). Long-term feature banks for detailed video understanding. In CVPR.
Xia, Y., Tan, X., Tian F., et al. (2018). Model-level dual learning. In ICML.
Xie, S., Sun, C., Huang, J., et al. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV.
Xiong, C., Merity, S., & Socher, R. (2016). Dynamic memory networks for visual and textual question answering. In ICML.
Yan, A., Wang, Y., Li, Z., et al. (2019). Pa3d: Pose-action 3d machine for video recognition. In “CVPR”.
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.
Yang, R., Zhang, J., Gao, X., et al. (2019). Simple and effective text matching with richer alignment features. In ACL (pp. 4699–4709).
Yang, Z., Dhingra, B., He, K., et al. (2018). Glomo: Unsupervisedly learned relational graphs as transferable representations. In NeurIPS.
Yin, M., Yao, Z., Cao, Y., et al. (2020). Disentangled non-local neural networks. In ECCV.
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122
Zhang, C., & Woodland, P. C. (2018). High order recurrent neural networks for acoustic modelling. In ICASSP.
Zhang, H., Goodfellow, I., Metaxas, D., et al. (2019). Self-attention generative adversarial networks. In ICML.
Zhang, P., Lan, C., Xing, J., et al. (2017). View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV.
Zhang, P., Xue, J., Lan, C., et al. (2018). Adding attentiveness to the neurons in recurrent neural networks. In ECCV.
Zhao, H., Jia, J., & Koltun, V. (2020). Exploring self-attention for image recognition. In CVPR.
Zhou, B., Andonian, A., Oliva, A., et al. (2018a). Temporal relational reasoning in videos. In “ECCV”.
Zhou, J., & Troyanskaya, O. (2014). Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In ICML (pp. 745–753).
Zhou, J., Wang, H., Zhao, Z., et al. (2018). Cnnh_pss: Protein 8-class secondary structure prediction by convolutional neural network with highway. BMC Bioinformatics, 19(4), 99–109.
Google Scholar

Download references

Acknowledgements

This work was supported in part by the NSFC fund (U2013210, 62006060, 62176077), in part by the Guangdong Basic and Applied Basic Research Foundation under Grant (2019Bl515120055, 2022A1515010306), in part by the Shenzhen Key Technical Project under Grant 2020N046, in part by the Shenzhen Fundamental Research Fund under Grant (JCYJ20210324132210025), in part by the Shenzhen Stable Support Plan Fund for Universities (GXWD20201230155427003-20200824125730001), in part by the Medical Biometrics Perception and Analysis Engineering Laboratory, Shenzhen, China, and in part by Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies (2022B1212010005).

Author information

Wenjie Pei and Xin Feng contributed equally to this work.

Authors and Affiliations

Department of Computer Science, Harbin Institute of Technology at Shenzhen, Shenzhen, 518057, Guangdong, China
Wenjie Pei, Xin Feng & Guangming Lu
Tecent, Shenzhen, China
Canmiao Fu
JD Explore Academy, Beijing, China
Qiong Cao
Kuaishou Technology, Beijing, China
Yu-Wing Tai

Authors

Wenjie Pei
View author publications
You can also search for this author in PubMed Google Scholar
Xin Feng
View author publications
You can also search for this author in PubMed Google Scholar
Canmiao Fu
View author publications
You can also search for this author in PubMed Google Scholar
Qiong Cao
View author publications
You can also search for this author in PubMed Google Scholar
Guangming Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Wing Tai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangming Lu.

Additional information

Communicated by Karteek Alahari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pei, W., Feng, X., Fu, C. et al. Learning Sequence Representations by Non-local Recurrent Neural Memory. Int J Comput Vis 130, 2532–2552 (2022). https://doi.org/10.1007/s11263-022-01648-y

Download citation

Received: 02 September 2021
Accepted: 02 July 2022
Published: 14 August 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11263-022-01648-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Sequence Representations by Non-local Recurrent Neural Memory

Abstract

Access this article

Similar content being viewed by others

Learning deep hierarchical and temporal recurrent neural networks with residual learning

Learning Sparse Hidden States in Long Short-Term Memory

Enhancing spatiotemporal predictive learning: an approach with nested attention module

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning Sequence Representations by Non-local Recurrent Neural Memory

Abstract

Access this article

Similar content being viewed by others

Learning deep hierarchical and temporal recurrent neural networks with residual learning

Learning Sparse Hidden States in Long Short-Term Memory

Enhancing spatiotemporal predictive learning: an approach with nested attention module

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation