Abstract
Transformer model has become the dominant modeling paradigm in deep learning, of which multi-head attention is a critical component. While increasing Transformer effect, it also has some issues. When the number of heads reaches a point, some attention heads have remarkably similar attention graphs, which indicates that these heads are doing repetitive calculations. Some heads may even focus on extraneous things, affecting the final result. After analyzing the multi-head attention mechanism, this paper believes that the consistency of the inputs to the multi-head attention mechanism is the underlying reason for the similarity of the attention graph between heads. For this reason, this paper proposes the concept of classifying the heads in multi-head attention mechanism and summarizes the general classification process. Three classification schemes are designed for the Multi30k dataset. Experiments demonstrate that our method converges faster than the baseline model and that the BLEU improves by 3.08–4.38 compared to the baseline model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Correia, G.M., Niculae, V., Martins, A.F.: Adaptively sparse transformers. arXiv preprint arXiv:1909.00015 (2019)
Devlin, J., Chang, M.W., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints arXiv:1810.04805, October 2018. https://ui.adsabs.harvard.edu/abs/2018arXiv181004805D
Dosovitskiy, A., Beyer, L., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th ICLR (2021)
Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30k: multilingual English-German image descriptions. arXiv preprint arXiv:1605.00459 (2016)
Huang, L., Yuan, Y., Guo, J., Zhang, C., Chen, X., Wang, J.: Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:1907.12273 (2019)
Kawakami, K.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis (2008)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Li, Y., Song, Y., et al.: Intelligent fault diagnosis by fusing domain adversarial training and maximum mean discrepancy via ensemble learning. IEEE TII 17(4), 2833–2841 (2021)
Liu, M., Zhang, S., et al.: H infinite state estimation for discrete-time chaotic systems based on a unified model. IEEE Trans. SMC (B) (2012)
Lu, Z., Wang, N., et al.: IoTDeM: an IoT big data-oriented MapReduce performance prediction extended model in multiple edge clouds. JPDC 118, 316–327 (2018)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
Qiu, H., Qiu, M., Lu, Z.: Selective encryption on ECG data in body sensor network based on supervised machine learning. Infor. Fusion 55, 59–67 (2020)
Qiu, H., Zheng, Q., et al.: Deep residual learning-based enhanced JPEG compression in the internet of things. IEEE TII 17(3), 2124–2133 (2020)
Qiu, H., Zheng, Q., et al.: Topological graph convolutional network-based urban traffic flow and density prediction. IEEE TITS 22(7), 4560–4569 (2021)
Qiu, M., Gai, K., Xiong, Z.: Privacy-preserving wireless communications using bipartite matching in social big data. FGCS 87, 772–781 (2018)
Qiu, M., Liu, J., et al.: A novel energy-aware fault tolerance mechanism for wireless sensor networks. In: IEEE/ACM Conference, GCC (2011)
Qiu, M., Xue, C., et al.: Energy minimization with soft real-time and DVS for uniprocessor and multiprocessor embedded systems. In: IEEE DATE Conference, pp. 1–6 (2007)
Shazeer, N., Lan, Z., Cheng, Y., Ding, N., Hou, L.: Talking-heads attention. arXiv preprint arXiv:2003.02436 (2020)
Tang, G., Nivre, J.: An analysis of attention mechanisms: the case of word sense disambiguation in neural machine translation. In: 3rd Conference on Machine Translation (2018)
Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Voita, E., Serdyukov, P., Sennrich, R., Titov, I.: Context-aware neural machine translation learns anaphora resolution. arXiv e-prints arXiv:1805.10163, May 2018. https://ui.adsabs.harvard.edu/abs/2018arXiv180510163V
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned (2019). https://doi.org/10.18653/v1/p19-1580
Wu, G., Zhang, H., et al.: A decentralized approach for mining event correlations in distributed system monitoring. JPDC 73(3), 330–340 (2013)
Zaheer, M., Guruganesh, G., et al.: Big bird: transformers for longer sequences (2021)
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. CoRR abs/1409.2329 (2014). http://arxiv.org/abs/1409.2329
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, F., Jiang, M., Liu, F., Xu, D., Fan, Z., Wang, Y. (2022). Classification of Heads in Multi-head Attention Mechanisms. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13370. Springer, Cham. https://doi.org/10.1007/978-3-031-10989-8_54
Download citation
DOI: https://doi.org/10.1007/978-3-031-10989-8_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10988-1
Online ISBN: 978-3-031-10989-8
eBook Packages: Computer ScienceComputer Science (R0)