Skip to main content
Log in

Learning Sequence Representations by Non-local Recurrent Neural Memory

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The key challenge of sequence representation learning is to capture the long-range temporal dependencies. Typical methods for supervised sequence representation learning are built upon recurrent neural networks to capture temporal dependencies. One potential limitation of these methods is that they only model one-order information interactions explicitly between adjacent time steps in a sequence, hence the high-order interactions between nonadjacent time steps are not fully exploited. It greatly limits the capability of modeling the long-range temporal dependencies since the temporal features learned by one-order interactions cannot be maintained for a long term due to temporal information dilution and gradient vanishing. To tackle this limitation, we propose the non-local recurrent neural memory (NRNM) for supervised sequence representation learning, which performs non-local operations by means of self-attention mechanism to learn full-order interactions within a sliding temporal memory block and models global interactions between memory blocks in a gated recurrent manner. Consequently, our model is able to capture long-range dependencies. Besides, the latent high-level features contained in high-order interactions can be distilled by our model. We validate the effectiveness and generalization of our NRNM on three types of sequence applications across different modalities, including sequence classification, step-wise sequential prediction and sequence similarity learning. Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://github.com/F-Frida/NRNM.

  2. https://github.com/pytorch/text.

References

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450

  • Bahdanau, D., Cho, K., & Bengio Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR.

  • Bello, I., Zoph, B., Vaswani, A., et al. (2019). Attention augmented convolutional networks. In ICCV.

  • Bertolami, R., Bunke, H., Fernandez S., et al. (2009). A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5).

  • Brock, A., Donahue, J., Simonyan K. (2019). Large scale gan training for high fidelity natural image synthesis. In ICLR.

  • Buades, A., Coll, B., Morel, J. M. (2005). A non-local algorithm for image denoising. In CVPR.

  • Cao, Y., Xu, J., Lin S., et al. (2019). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In ICCV workshops.

  • Carion, N., Massa, F., Synnaeve G., et al. (2020). End-to-end object detection with transformers. In ECCV.

  • Carreira, J., & Zisserman A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In “CVPR”.

  • Cho, K., Van Merriënboer, B., Gulcehre C., et al. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  • Choutas, V., Weinzaepfel, P., Revaud, J., et al. (2018). Potion: Pose motion representation for action recognition. In “CVPR”.

  • Dai, A. M., & Le, Q. V. (2015). Semi-supervised sequence learning. In NeurIPS.

  • Dai, Z., Yang, Z., Yang, Y., et al. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. In ACL.

  • Devlin, J., Chang, M. W., Lee, K., et al. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.

  • Dieng, A. B., Wang, C., Gao, J., et al. (2017). Topicrnn: A recurrent neural network with long-range semantic dependency. In ICLR.

  • Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.

  • Feichtenhofer, C., Fan, H., Malik, J., et al. (2019). Slowfast networks for video recognition. In ICCV.

  • Fu, C., Pei, W., Cao, Q., et al. (2019a). Non-local recurrent neural memory for supervised sequence modeling. In ICCV.

  • Fu, J., Liu, J., Tian, H., et al. (2019b). Dual attention network for scene segmentation. In CVPR.

  • Fu, J., Liu, J., Tian, H., et al. (2019c). Dual attention network for scene segmentation. In CVPR.

  • Grave, E., Joulin, A., Usunier, N. (2017). Improving neural language models with a continuous cache. In ICLR.

  • Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines. arXiv preprint arXiv:1410.5401

  • He, K., Gkioxari, G., Dollár, P., et al. (2017). Mask R-CNN. In ICCV.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hochreiter, S., Bengio, Y., Frasconi, P., et al. (2001). Gradient flow in recurrent nets: The difficulty of learning long-term dependencies

  • Hu, H., Zhang, Z., Xie, Z., et al. (2019). Local relation networks for image recognition. In ICCV.

  • Huang, G., Liu, Z., Van Der Maaten, L., et al. (2017). Densely connected convolutional networks. In CVPR.

  • Johnson, R., & Zhang, T. (2014). Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058

  • Johnson, R., & Zhang, T. (2016). Supervised and semi-supervised text categorization using LSTM for region embeddings. In ICML.

  • Kay, W., Carreira, J., Simonyan, K., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

  • Kingma, D. P., & Ba, J.(2015). Adam: A method for stochastic optimization. In ICLR.

  • Krueger, D., Maharaj, T., Kramár, J., et al. (2017). Zoneout: Regularizing rnns by randomly preserving hidden activations. In ICLR.

  • Kumar, A., Irsoy, O., Ondruska, P., et al. (2016). Ask me anything: Dynamic memory networks for natural language processing. In ICML.

  • Lafferty, J. D., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.

  • Lee, I., Kim, D., Kang S., et al. (2017). Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In ICCV.

  • Li, Z., & Yu, Y. (2016). Protein secondary structure prediction using cascaded convolutional and recurrent neural networks. arXiv preprint arXiv:1604.07176

  • Liu, J., Shahroudy, A., Xu, D., et al. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In ECCV.

  • Liu, Z., Lin, Y., Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.

  • Maas, A. L., Daly, R. E., Pham, P. T., et al. (2011). Learning word vectors for sentiment analysis. In ACL.

  • McCann, B., Bradbury, J., Xiong, C., et al. (2017). Learned in translation: Contextualized word vectors. In NeurIPS.

  • Miyato, T., Dai, A. M., Goodfellow, I. (2017). Adversarial training methods for semi-supervised text classification. In ICLR.

  • Morency, L. P., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for continuous gesture recognition. In CVPR.

  • Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In AAAI.

  • Pei, W., Tax, D. M., & van der Maaten, L. (2016). Modeling time series similarity with siamese recurrent networks. arXiv preprint arXiv:1603.04713

  • Pei, W., Baltrusaitis, T., Tax, D. M., et al. (2017). Temporal attention-gated model for robust sequence classification. In CVPR.

  • Pei, W., Dibeklioğlu, H., Tax, D. M., et al. (2018). Multivariate time-series classification using the hidden-unit logistic model. IEEE Transactions on Neural Networks and Learning Systems, 29(4), 920–931.

    Article  Google Scholar 

  • Peng, J., Bo, L., & Xu J. (2009). Conditional neural fields. In NeurIPS.

  • Pollastri, G., Przybylski, D., Rost, B., et al. (2002). Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics 47(2):228–235

  • Qiu, Z., et al. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV.

  • Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  • Radford, A., Jozefowicz, R., & Sutskever, I. (2017). Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444

  • Ramachandran, P., Parmar, N., Vaswani, A et al. (2019) Stand-alone self-attention in vision models. In NeurIPS.

  • Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations by back-propagating errors. Cognitive Modeling, 5(3), 1.

    MATH  Google Scholar 

  • Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.

  • Santoro, A., Faulkner, R., Raposo, D., et al. (2018). Relational recurrent neural networks. In NeurIPS.

  • Shahroudy, A., Liu, J., Ng, T. T., et al. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR.

  • Si, C., Jing, Y., Wang, W., et al. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV.

  • Sigurdsson, G. A., Varol, G., Wang, X., et al. (2016). Hollywood in homes: Crowdsourcing data collection for activity understanding. In “ECCV”.

  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In “NIPS”.

  • Soltani, R., & Jiang, H. (2016). Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064

  • Song, S., Lan, C., Xing, J., et al. (2018). Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Transactions on Image Processing, 27(7), 3459–3471.

    Article  MathSciNet  Google Scholar 

  • Srinivas, A., Lin, T. Y., Parmar, N., et al. (2021). Bottleneck transformers for visual recognition. In CVPR.

  • Sukhbaatar, S., Weston, J., Fergus, R., et al. (2015). End-to-end memory networks. In NeurIPS.

  • Szegedy, C., Liu, W., Jia, Y., et al. (2015).Going deeper with convolutions. In CVPR.

  • Tan, C., Wei, F., Wang, W., et al. (2018).Multiway attention networks for modeling sentence pairs. In IJCAI.

  • Tu, J., Liu, H., Meng, F., et al. (2018). Spatial-temporal data augmentation based on LSTM autoencoder network for skeleton-based human action recognition. In ICIP.

  • Van Der Maaten, L., Welling, M., & Saul, L. (2011). Hidden-unit conditional random fields. In Proceedings of the fourteenth international conference on artificial intelligence and statistics.

  • Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In NeurIPS.

  • Wang, G., & Dunbrack, R. L., Jr. (2003). Pisces: A protein sequence culling server. Bioinformatics, 19(12), 1589–1591.

    Article  Google Scholar 

  • Wang, H., & Wang, L. (2017). Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In CVPR.

  • Wang, S., Peng, J., Ma, J., et al. (2016). Protein secondary structure prediction using deep convolutional neural fields. Scientific Reports, 6(1), 1–11.

    Article  Google Scholar 

  • Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In “ECCV”.

  • Wang, X., Girshick, R., Gupta, A., et al. (2018). Non-local neural networks. In CVPR.

  • Wang, Z., Zhao, F., Peng, J., et al. (2010). Protein 8-class secondary structure prediction using conditional neural fields. In 2010 IEEE International conference on bioinformatics and biomedicine (BIBM) (pp. 109–114).

  • Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks. In ICLR.

  • Wu, C. Y., Feichtenhofer, C., Fan, H., et al. (2019). Long-term feature banks for detailed video understanding. In CVPR.

  • Xia, Y., Tan, X., Tian F., et al. (2018). Model-level dual learning. In ICML.

  • Xie, S., Sun, C., Huang, J., et al. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV.

  • Xiong, C., Merity, S., & Socher, R. (2016). Dynamic memory networks for visual and textual question answering. In ICML.

  • Yan, A., Wang, Y., Li, Z., et al. (2019). Pa3d: Pose-action 3d machine for video recognition. In “CVPR”.

  • Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.

  • Yang, R., Zhang, J., Gao, X., et al. (2019). Simple and effective text matching with richer alignment features. In ACL (pp. 4699–4709).

  • Yang, Z., Dhingra, B., He, K., et al. (2018). Glomo: Unsupervisedly learned relational graphs as transferable representations. In NeurIPS.

  • Yin, M., Yao, Z., Cao, Y., et al. (2020). Disentangled non-local neural networks. In ECCV.

  • Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122

  • Zhang, C., & Woodland, P. C. (2018). High order recurrent neural networks for acoustic modelling. In ICASSP.

  • Zhang, H., Goodfellow, I., Metaxas, D., et al. (2019). Self-attention generative adversarial networks. In ICML.

  • Zhang, P., Lan, C., Xing, J., et al. (2017). View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV.

  • Zhang, P., Xue, J., Lan, C., et al. (2018). Adding attentiveness to the neurons in recurrent neural networks. In ECCV.

  • Zhao, H., Jia, J., & Koltun, V. (2020). Exploring self-attention for image recognition. In CVPR.

  • Zhou, B., Andonian, A., Oliva, A., et al. (2018a). Temporal relational reasoning in videos. In “ECCV”.

  • Zhou, J., & Troyanskaya, O. (2014). Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In ICML (pp. 745–753).

  • Zhou, J., Wang, H., Zhao, Z., et al. (2018). Cnnh_pss: Protein 8-class secondary structure prediction by convolutional neural network with highway. BMC Bioinformatics, 19(4), 99–109.

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the NSFC fund (U2013210, 62006060, 62176077), in part by the Guangdong Basic and Applied Basic Research Foundation under Grant (2019Bl515120055, 2022A1515010306), in part by the Shenzhen Key Technical Project under Grant 2020N046, in part by the Shenzhen Fundamental Research Fund under Grant (JCYJ20210324132210025), in part by the Shenzhen Stable Support Plan Fund for Universities (GXWD20201230155427003-20200824125730001), in part by the Medical Biometrics Perception and Analysis Engineering Laboratory, Shenzhen, China, and in part by Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies (2022B1212010005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guangming Lu.

Additional information

Communicated by Karteek Alahari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pei, W., Feng, X., Fu, C. et al. Learning Sequence Representations by Non-local Recurrent Neural Memory. Int J Comput Vis 130, 2532–2552 (2022). https://doi.org/10.1007/s11263-022-01648-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01648-y

Keywords

Navigation