Skip to main content
Log in

Temporal attention augmented transformer Hawkes process

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In recent years, mining the knowledge from asynchronous sequences by Hawkes process is a subject worthy of continued attention, and Hawkes processes based on the neural network have gradually become the most hotly researched fields, especially based on the recurrence neural network (RNN). However, these models still contain some inherent shortcomings of RNN, such as vanishing and exploding gradient and long-term dependency problems. Meanwhile, transformer-based on self-attention has achieved great success in sequential modeling like text processing and speech recognition. Although the Transformer Hawkes process (THP) has gained huge performance improvement, THPs do not effectively utilize the temporal information in the asynchronous events, for these asynchronous sequences, the event occurrence instants are as important as the types of events, while conventional THPs simply convert temporal information into position encoding and add them as the input of transformer. With this in mind, we come up with a new kind of Transformer-based Hawkes process model, temporal attention augmented transformer Hawkes Process (TAA-THP), we modify the traditional dot-product attention structure and introduce the temporal encoding into attention structure. We conduct numerous experiments on a wide range of synthetic and real-life datasets to validate the performance of our proposed TAA-THP model, a significant improvement compared with existing baseline models on the different measurements is achieved, including log-likelihood on the test dataset, and prediction accuracies of event types and occurrence times. In addition, through the ablation studies, we vividly demonstrate the merit of introducing additional temporal attention by comparing the performance of the model with and without temporal attention.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Ogata Y (1998) Space-time point-process models for earthquake occurrences. Ann Inst Stat Math 50(2):379–402

    Article  Google Scholar 

  2. Johnson AE, Pollard TJ, Shen L, Li-Wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG (2016) Mimic-III, a freely accessible critical care database. Scientific data 3(1):1–9

    Article  Google Scholar 

  3. Mohler G, Carter J, Raje R (2018) Improving social harm indices with a modulated Hawkes process. Int J Forecast 34(3):431–439

    Article  Google Scholar 

  4. Zhang L-N, Liu J-W, Zuo X (2020) Survival analysis of failures based on Hawkes process with Weibull base intensity. Eng Appl Artif Intell 93:103709

    Article  Google Scholar 

  5. Luo D, Xu H, Zha H, Du J, Xie R, Yang X, Zhang W (2014) You are what you watch and when you watch: Inferring household structures from IPTV viewing data. IEEE Trans Broadcast 60(1):61–72

    Article  Google Scholar 

  6. Zhao Q, Erdogdu MA, He HY, Rajaraman A, Leskovec J (2015) Seismic: a self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1513–1522

  7. Daley DJ, Vere-Jones D (2007) An introduction to the theory of point processes: volume II general theory and structure. Springer Science & Business Media

  8. Hawkes AG (1971) Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1):83–90

    Article  MathSciNet  Google Scholar 

  9. Reynaud-Bouret P, Schbath S et al (2010) Adaptive estimation for Hawkes processes; application to genome analysis. Ann Stat 38(5):2781–2822

    Article  MathSciNet  Google Scholar 

  10. Kobayashi R, Lambiotte R (2016) Tideh: time-dependent Hawkes process for predicting retweet dynamics. In: Proceedings of the tenth international conference on web and social media (ICWSM), pp. 191–200, ICWSM

  11. Xu H, Farajtabar M, Zha H (2016) Learning Granger causality for Hawkes processes. In: International conference on machine learning, pp. 1717–1726

  12. Zhou K, Zha H, Song L (2013) Learning social infectivity in sparse low-rank networks using multidimensional Hawkes processes. In: Artificial Intelligence and Statistics, pp. 641–649

  13. Du N, Dai H, Trivedi R, Upadhyay U, Gomez-Rodriguez M, Song L (2016) Recurrent marked temporal point processes: Embedding event history to vector. In: Proceedings of the 22nd ACM SIGKDD International conference on knowledge discovery and data mining, pp. 1555–1564

  14. Mei H, Eisner JM (2017) The neural Hawkes process: a neurally self-modulating multivariate point process. In: Advances in Neural Information Processing Systems, pp. 6754–6764

  15. Xiao S, Yan J, Yang X, Zha H, Chu SM (2017) Modeling the intensity function of point process via recurrent neural networks. In: Thirty-first AAAI conference on artificial intelligence

  16. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Networks 5(2):157–166

    Article  Google Scholar 

  17. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318

  18. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio Y, LeCun Y (ed) 3rd International conference on learning representations, ICLR

  19. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp. 5998-6008

  20. Yu D, Deng Li (2016) Automatic speech recognition. Springer, London

    MATH  Google Scholar 

  21. Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge

    Book  Google Scholar 

  22. Girdhar R, Carreira J, Doersch C, et al. (2019) Video action transformer network. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 244–253.

  23. Zhang Q, Lipani A, Kirnap O, Yilmaz E (2020) Self-attentive Hawkes process. In Proceedings of the 37th international conference on machine learning, pp. 11183–11193, ICML

  24. Zuo S, Jiang H, Li Z, Zhao T, Zha H (2020) Transformer hawkes process. In Proceedings of the 37th international conference on machine learning, pp. 11692–11702, ICML

  25. Dai Z, Yang Z et al. (2019) Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 2978–2988, ACL

  26. Al-Rfou R Choe D et al. (2019) Character-level language modeling with deeper self-attention. In: The thirty-third conference on artificial intelligence, pp 3159–3166, AAAI

  27. Yang Y, Etesami J, He N, Kiyavash N (2017) Online learning for multivariate Hawkes processes. Adv Neural Inf Process Syst 30:4937–4946

    Google Scholar 

  28. Hawkes AG (2018) Hawkes processes and their applications to finance: a review. Quant Fin 18(2):193–198

    Article  MathSciNet  Google Scholar 

  29. Hansen NR, Reynaud-Bouret P, Rivoirard V (2015) Lasso and probabilistic inequalities for multivariate point processes. Bernoulli 21(1):83–143

    Article  MathSciNet  Google Scholar 

  30. Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser L (2019) Universal transformers. In: 7th International conference on learning representations, ICLR

  31. Graves A (2016) Adaptive computation time for recurrent neural networks. arXiv preprint https://arxiv.org/abs/1603.08983

  32. WangC, Li M, Smola AJ (2019) Language models with transformers. arXiv preprint https://arxiv.org/abs/1904.09408

  33. Robert C, Casella G (2013) Monte Carlo statistical methods. Springer Science & Business Media, Cham

    MATH  Google Scholar 

  34. Stoer J, Bulirsch R (2013) Introduction to numerical analysis, vol 12. Springer Science & Business Media, Cham

    MATH  Google Scholar 

  35. Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: 3rd International conference on learning representations, ICLR

  36. Leskovec J, Krevl A (2014) Snap datasets: Stanford large network dataset collection

Download references

Funding

This work was supported by the Science Foundation of China University of Petroleum, Beijing (No. 2462020YXZZ023). Thanks for Hong-Yuan Mei and Si-Miao Zuo for their generous help in our research state, their help greatly improved our research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian-wei Liu.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled “Temporal Attention Augmented Transformer Hawkes Process.”

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Ln., Liu, Jw., Song, Zy. et al. Temporal attention augmented transformer Hawkes process. Neural Comput & Applic 34, 3795–3809 (2022). https://doi.org/10.1007/s00521-021-06641-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06641-z

Keywords

Navigation