Skip to main content
Log in

Backpropagation Computation for Training Graph Attention Networks

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Graph Neural Networks (GNNs) are a form of deep learning that have found use for a variety of problems, including the modeling of drug interactions, time-series analysis, and traffic prediction. They represent the problem using non-Euclidian graphs, allowing for a high degree of versatility, and are able to learn complex relationships by iteratively aggregating more contextual information from neighbors that are farther away. Inspired by its power in transformers, Graph Attention Networks (GATs) incorporate an attention mechanism on top of graph aggregation. GATs are considered the state of the art due to their superior performance. To learn the best parameters for a given graph problem, GATs use traditional backpropagation to compute weight updates. To the best of our knowledge, these updates are calculated in software, and closed-form equations describing their calculation for GATs aren’t well known. This paper derives closed-form equations for backpropagation in GATs using matrix notation. These equations can form the basis for design of hardware accelerators for training GATs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3

Similar content being viewed by others

Data Availability

Data sharing is not applicable to this article, as no datasets were generated or analyzed during the current study.

References

  1. Cheng, Z., Yan, C., Wu, F. X., & Wang, J. (2022). Drug-target interaction prediction using multi-head self-attention and graph attention network. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(4), 2208–2218. https://doi.org/10.1109/TCBB.2021.3077905. Conference Name: IEEE/ACM Transactions on Computational Biology and Bioinformatics.

  2. Yang, Z., Liu, J., Wang, Z., Wang, Y., & Feng, J. (2020). Multi-class metabolic pathway prediction by graph attention-based deep learning method. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 126–131. https://doi.org/10.1109/BIBM49941.2020.9313298

  3. Zhao, H., Wang, Y., Duan, J., Huang, C., Cao, D., Tong, Y., Xu, B., Bai, J., Tong, J., & Zhang, Q. (2020). Multivariate time-series anomaly detection via graph attention network. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 841–850. https://doi.org/10.1109/ICDM50108.2020.00093. ISSN: 2374-8486.

  4. Zhang, C., James, J. Q., & Liu, Y. (2019). Spatial-temporal graph attention networks: A deep learning approach for traffic forecasting. IEEE Access, 7, 166246–166256. https://doi.org/10.1109/ACCESS.2019.2953888. Conference Name: IEEE Access.

  5. Balaji, S. S., & Parhi, K. K. (2023). Classifying Subjects with PFC Lesions from Healthy Controls during Working Memory Encoding via Graph Convolutional Networks. In: 2023 11th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 1–4. https://doi.org/10.1109/NER52421.2023.10123793. ISSN: 1948-3554.

  6. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph Attention Networks. arXiv. arXiv:1710.10903 [cs, stat]. http://arxiv.org/abs/1710.10903. Accessed 24 Feb 2023.

  7. Brody, S., Alon, U., & Yahav, E. (January 2022). How attentive are graph attention networks? Technical Report arXiv:2105.14491, arXiv. arXiv:2105.14491 [cs] type: article. http://arxiv.org/abs/2105.14491. Accessed 2024 Feb 2023.

  8. Unnikrishnan, N. K., & Parhi, K. K. (2023). InterGrad: Energy-efficient training of convolutional neural networks via interleaved gradient scheduling. IEEE Transactions on Circuits and Systems I: Regular Papers, 70(5), 1949–1962. https://doi.org/10.1109/TCSI.2023.3246468. Conference Name: IEEE Transactions on Circuits and Systems I: Regular Papers

  9. Gori, M., Monfardini, G., & Scarselli, F. (2005). A new model for learning in graph domains. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, vol. 2, pp. 729–7342. https://doi.org/10.1109/IJCNN.2005.1555942. ISSN: 2161-4407

  10. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2009). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61–80. https://doi.org/10.1109/TNN.2008.2005605. Conference Name: IEEE Transactions on Neural Networks

  11. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Yu, P. S. (2021). A comprehensive survey on graph neural networks. IEEE Transactions Neural Networks Learning System, 32(1), 4–24. https://doi.org/10.1109/TNNLS.2020.2978386. arXiv:1901.00596 [cs, stat]. Accessed 24 Feb 2023.

  12. Parhi, K. K., & Unnikrishnan, N. K. (2020). Brain-inspired computing: models and architectures. IEEE Open Journal of Circuits and Systems, 1, 185–204. https://doi.org/10.1109/OJCAS.2020.3032092. Conference Name: IEEE Open Journal of Circuits and Systems

  13. Kipf, T. N., & Welling, M. (February 2017). Semi-supervised classification with graph convolutional networks. Technical Report. arXiv:1609.02907, arXiv. arXiv:1609.02907 [cs, stat] type: article. http://arxiv.org/abs/1609.02907. Accessed 24 Feb 2023.

  14. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (June 2017). Neural message passing for quantum chemistry. Technical Report. arXiv:1704.01212, arXiv. arXiv:1704.01212 [cs] type: article. http://arxiv.org/abs/1704.01212. Accessed 24 Feb 2023.

  15. Zhang, B., & Prasanna, V. (2023). Dynasparse: Accelerating GNN inference through dynamic sparsity exploitation. https://arxiv.org/abs/2303.12901v1. Accessed 3 Jun 2023.

  16. Mondal, S., Manasi, S. D., Kunal, K., Ramprasath, S., & Sapatnekar, S. S. (2021). GNNIE: GNN inference engine with load-balancing and graph-specific caching. https://arxiv.org/abs/2105.10554v2. Accessed 3 Jun 2023.

  17. He, Z., Tian, T., Wu, Q., & Jin, X. (2023). FTW-GAT: An FPGA-based accelerator for graph attention networks with ternary weights. IEEE Transactions on Circuits and Systems II: Express Briefs, 1–1. https://doi.org/10.1109/TCSII.2023.3280180. Conference Name: IEEE Transactions on Circuits and Systems II: Express Briefs.

  18. Geng, T., Wu, C., Zhang, Y., Tan, C., Xie, C., You, H., Herbordt, M., Lin, Y., & Li, A. (2021). I-GCN: A graph convolutional network accelerator with runtime locality enhancement through islandization. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO ’21, pp. 1051–1063. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3466752.3480113

  19. Zeng, H., & Prasanna, V. (2019). GraphACT: Accelerating GCN training on CPU-FPGA heterogeneous platforms. https://doi.org/10.1145/3373087.3375312. https://arxiv.org/abs/2001.02498v1. Accessed 3 Jun 2023.

  20. Chen, X., Wang, Y., Xie, X., Hu, X., Basak, A., Liang, L., Yan, M., Deng, L., Ding, Y., Du, Z., & Xie, Y. (2022). Rubik: A hierarchical architecture for efficient graph neural network training. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(4), 936–949. https://doi.org/10.1109/TCAD.2021.3079142. Conference Name: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

  21. Zheng, D., Ma, C., Wang, M., Zhou, J., Su, Q., Song, X., Gan, Q., Zhang, Z., & Karypis, G. (2020). DistDGL: Distributed graph neural network training for billion-scale graphs. https://arxiv.org/abs/2010.05337v3. Accessed 3 Jun 2023.

  22. Lin, Z., Li, C., Miao, Y., Liu, Y., & Xu, Y. (2020). Pagraph: Scaling GNN training on large graphs via computation-aware caching. In: Proceedings of the 11th ACM Symposium on Cloud Computing. SoCC ’20, pp. 401–415. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3419111.3421281

  23. Lin, Y.-C., Zhang, B., & Prasanna, V. (2023). HitGNN: High-throughput GNN training framework on CPU+Multi-FPGA heterogeneous platform. https://arxiv.org/abs/2303.01568v1. Accessed 3 Jun 2023.

  24. Luong, M.-T., Pham, H., Manning, C.D. (2015). Effective approaches to attention-based neural machine translation. arXiv. arXiv:1508.04025 [cs]. http://arxiv.org/abs/1508.04025. Accessed 27 Jan 2023.

  25. Gehring, J., Auli, M., Grangier, D., & Dauphin, Y. N. (July 2017). A convolutional encoder model for neural machine translation. Technical Report. arXiv:1611.02344, arXiv. arXiv:1611.02344 [cs] type: article. http://arxiv.org/abs/1611.02344. Accessed 27 Jan 2023.

  26. Mnih, V., Heess, N., Graves, A., & Kavukcuoglu, K. (June 2014). Recurrent Models of Visual Attention. Technical Report. arXiv:1406.6247, arXiv. arXiv:1406.6247 [cs, stat] type: article. http://arxiv.org/abs/1406.6247. Accessed 27 Jan 2023.

  27. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (December 2017). Attention is all you need. Technical Report. arXiv:1706.03762, arXiv. arXiv:1706.03762 [cs] type: article. http://arxiv.org/abs/1706.03762. Accessed 27 Jan 2023.

  28. Unnikrishnan, N. K., & Parhi, K. K. (2021). LayerPipe: Accelerating deep neural network training by intra-layer and inter-layer gradient pipelining and multiprocessor scheduling. In: 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1–8. https://doi.org/10.1109/ICCAD51958.2021.9643567. ISSN: 1558-2434.

Download references

Acknowledgements

The authors thank Nanda Unnikrishnan for numerous useful discussions.

Funding

This paper was supported in part by the National Science Foundation under grant number CCF-1954749.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Joe Gould or Keshab K. Parhi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix. Supplementary Figures

Appendix. Supplementary Figures

Figure 4
figure 4

Forward pass of GAT Layer on an 8-Node graph, with 5-feature width input and 4-feature width attention head output. Differences between GATv1 and GATv2 are separated with a vertical line, such that GATv1’s operations appear on the left and GATv2’s on the right. The matrices are colored by shape according to the legends in Fig. 2. Subfigure a shows the input to the layer, with each node having a feature vector associated with it. Subfigure b shows the combination and edge coefficient calculation steps, corresponding to Eqs. (2)–(5), with each node having a source and destination coefficient associated with it. Subfigure c shows the pre-softmax attention coefficient calculation from Eqs. (8) and (9). Subfigure d shows the row-wise softmax operation on the matrix of these coefficients, from Eq. (10). Finally, subfigure e shows the aggregation step, weighted with the attention coefficients from Eqs. (11) and (12) (for brevity, we omit the near identical GATv2 equation using \(\mathbf {X^{k,l}_\text {src}}\) instead of \(\mathbf {X^{k,l}}\). The figure only shows a single attention head, and the output from subfigure e would be concatenated with the output of the other attention heads.

Figure 5
figure 5figure 5

Backward pass for the forward pass example in Fig. 4. Differences between GATv1 and GATv2 are separated with a vertical line, such that GATv1’s operations appear on the left and GATv2’s on the right, except for the final step in weight and feature gradient calculation. The matrices are colored by shape according to the legends in Fig. 3. Subfigure a shows the calculation of the gradients with respect to the input of the layer activation function from Eq. (15). Subfigure b shows the gradient with respect to the aggregation step due to the attention coefficients, corresponding to Eq. (19). Subfigure c follows the gradient to the input of the softmax, covering Eq. (21). Subfigure d shows the computation of the gradient with respect to attention weights from Eqs. (26) and (30). In subfigure e, the \(\mathbf {{C^\prime }^{k,l}}\) matrix is shown following Eqs. (23) and (27) and used to find \(\mathbf {\Delta ^{k,l}}\) as in Eqs. (24) and (28). Subfigure f shows the calculation of the \(\mathbf {\Sigma ^{k,l}}\) matrices following Eqs. (25) and (29), and the multiplication with the attention weights in GATv1’s case. Finally, the gradient with respect to the weights and input features is shown for GATv1 in subfigure g from Eqs. (32) and (35) and GATv2 in subfigure h from Eqs. (32) and (39).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gould, J., Parhi, K.K. Backpropagation Computation for Training Graph Attention Networks. J Sign Process Syst 96, 1–14 (2024). https://doi.org/10.1007/s11265-023-01897-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-023-01897-1

Keywords

Navigation