Skip to main content
Log in

BHPVAS: visual analysis system for pruning attention heads in BERT model

  • Regular Paper
  • Published:
Journal of Visualization Aims and scope Submit manuscript

Abstract

In the field of deep learning, pre-trained BERT models have achieved remarkable success. However, the accompanying problem is that models with more complex structures and more network parameters. The huge parameter size makes the computational cost in terms of time and memory become extremely expensive. Recent work has indicated that BERT models own a significant amount of redundant attention heads. Meanwhile considerable BERT models compression algorithms have been proposed, which can effectively reduce model complexity and redundancy with pruning some attention heads. Nevertheless, existing automated model compression solutions are mainly based on predetermined pruning program, which requires multiple expensive pruning-retraining cycles or heuristic designs to select additional hyperparameters. Furthermore, the training process of BERT models is a black box, and lacks interpretability, which makes researchers cannot intuitively understand the optimization process of the model. In this paper, we propose a visual analysis system, BHPVAS, for pruning BERT models, which helps researchers to incorporate their understanding of model structure and operating mechanism into the model pruning process and generate pruning schemes. We propose three pruning criteria based on the attention data, namely, importance score, stability score, and similarity score, for evaluating the importance of self-attention heads. Additionally, we design multiple collaborative views to display the entire pruning process, guiding users to carry out pruning. Our system supports exploring the role of self-attention heads in the model inference process using text dependency relations and attention weight distribution. Finally, we conduct two case studies to demonstrate how to use the system for Sentiment Classification Sample Analysis and Pruning Scheme Exploration, verifying the effectiveness of the visual analysis system.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipel.

References

  • Ahn Y, Lin Y-R (2020) Fairsight: visual analytics for fairness in decision making. IEEE Trans Vis Comput Graph 26(1):1086–1095. https://doi.org/10.1109/TVCG.2019.2934262

    Article  Google Scholar 

  • Aken B, Winter B, Löser A, Gers FA (2020) Visbert: hidden-state visualizations for transformers. In: Companion proceedings of the web conference 2020, WWW’20. Association for Computing Machinery, New York, pp 207–211. https://doi.org/10.1145/3366424.3383542

  • Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Proceedings of the 34th international conference on neural information processing systems, NIPS’20. Curran Associates Inc., Red Hook. https://doi.org/10.5555/3495724.3495883

  • Cao K, Liu M, Su H, Wu J, Zhu J, Liu S (2021) Analyzing the noise robustness of deep neural networks. IEEE Trans Vis Comput Graph 27(7):3289–3304. https://doi.org/10.1109/TVCG.2020.2969185

    Article  Google Scholar 

  • Carreira-Perpinan MA, Idelbayev Y (2018) Learning-compression” algorithms for neural net pruning. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 8532–8541. https://doi.org/10.1109/CVPR.2018.00890

  • Cavallo M, Demiralp Ç (2018) Clustrophile 2: guided visual clustering analysis. IEEE Trans Vis Comput Graph 25(1):267–276. https://doi.org/10.1109/TVCG.2018.2864477

    Article  Google Scholar 

  • Chauvin Y (1988) A back-propagation algorithm with optimal use of hidden units. In: Proceedings of the 1st international conference on neural information processing systems, NIPS’88. MIT Press, Cambridge, pp 519–526. https://doi.org/10.5555/2969735.2969795

  • Chiliang Z, Tao H, Yingda G, Zuochang Y (2019) Accelerating convolutional neural networks with dynamic channel pruning. In: 2019 Data compression conference (DCC), pp 563–563. https://doi.org/10.1109/DCC.2019.00075

  • Cortes C, Mohri M, Rostamizadeh A (2012) Algorithms for learning kernels based on centered alignment. J Mach Learn Res 13(1):795–828. https://doi.org/10.5555/2503308.2188413

    Article  MathSciNet  Google Scholar 

  • DeRose JF, Wang J, Berger M (2021) Attention flows: Analyzing and comparing attention mechanisms in language models. IEEE Trans Vis Comput Graph 27(2):1160–1170. https://doi.org/10.1109/TVCG.2020.3028976

    Article  Google Scholar 

  • Devlin J, Chang M-W, Lee K, Toutanova K (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423

  • Ghojogh B, Crowley M, Karray F, Ghodsi A (2023) Uniform manifold approximation and projection (UMAP). Springer, Cham, pp 479–497. https://doi.org/10.1007/978-3-031-10602-6_17

    Book  Google Scholar 

  • Gordon M, Duh K, Andrews N (2020) Compressing BERT: studying the effects of weight pruning on transfer learning. In: Proceedings of the 5th workshop on representation learning for NLP. Association for Computational Linguistics, Online, pp 143–155. https://doi.org/10.18653/v1/2020.repl4nlp-1.18

  • Gretton A, Bousquet O, Smola A, Schölkopf B (2005) Measuring statistical dependence with Hilbert–Schmidt norms. In: Jain S, Simon HU, Tomita E (eds) Algorithmic learning theory. Springer, Berlin, pp 63–77. https://doi.org/10.1007/11564089_7

    Chapter  Google Scholar 

  • Guo F-M, Liu S, Mungall FS, Lin X, Wang Y (2019) Reweighted proximal pruning for large-scale language representation. ArXiv arXiv:1909.12486

  • Guo F-M, Liu S, Mungall FS, Lin X, Wang Y (2020) Reweighted proximal pruning for large-scale language representation

  • Guo Y, Yao A, Chen Y (2016) Dynamic network surgery for efficient DNNs. In: Proceedings of the 30th international conference on neural information processing systems, NIPS’16. Curran Associates Inc., Red Hook, pp 1387–1395. https://doi.org/10.5555/3157096.3157251

  • Han D, Pan J, Pan R, Zhou D, Cao N, He J, Xu M, Chen W (2022) inet: Visual analysis of irregular transition in multivariate dynamic networks. Front Comput Sci. https://doi.org/10.1007/s11704-020-0013-1

    Article  Google Scholar 

  • Han M, Kim J (2019) Joint banknote recognition and counterfeit detection using explainable artificial intelligence. Sensors. https://doi.org/10.3390/s19163607

    Article  Google Scholar 

  • Han S, Pool J, Tran J, Dally WJ (2015) Learning both weights and connections for efficient neural networks. In: Proceedings of the 28th international conference on neural information processing systems—volume 1, NIPS’15. MIT Press, Cambridge, pp 1135–1143. https://doi.org/10.5555/2969239.2969366

  • He T, Jin X, Ding G, Yi L, Yan C (2019) Towards better uncertainty sampling: active learning with multiple views for deep convolutional neural network. In: 2019 IEEE international conference on multimedia and expo (ICME), pp 1360–1365. https://doi.org/10.1109/ICME.2019.00236

  • He Y, Zhang X, Sun J (2017) Channel pruning for accelerating very deep neural networks. In: 2017 IEEE international conference on computer vision (ICCV), pp 1398–1406. https://doi.org/10.1109/ICCV.2017.155

  • Ji X, Tu Y, He W, Wang J, Shen H-W, Yen P-Y (2021) Usevis: visual analytics of attention-based neural embedding in information retrieval. Vis Inform 5(2):1–12. https://doi.org/10.1016/j.visinf.2021.03.003

    Article  Google Scholar 

  • Kahng M, Andrews PY, Kalro A, Chau DH (2018) Activis: visual exploration of industry-scale deep neural network models. IEEE Trans Vis Comput Graph 24(1):88–97. https://doi.org/10.1109/TVCG.2017.2744718

    Article  Google Scholar 

  • Kornblith S, Norouzi M, Lee H, Hinton G (2019). Similarity of neural network representations revisited. In: International conference on machine learning. PMLR, pp 3519–3529

  • Leroux S, Bohez S, De Coninck E, Verbelen T, Vankeirsbilck B, Simoens P, Dhoedt B (2017) The cascading neural network: building the internet of smart things. Knowl Inf Syst 52:791–814. https://doi.org/10.1007/s10115-017-1029-1

    Article  Google Scholar 

  • Lin J, Rao Y, Lu J, Zhou J (2017a) Runtime neural pruning. In: NIPS, pp 2178–2188

  • Luo J-H, Wu J, Lin W (2017b) Thinet: a filter level pruning method for deep neural network compression. In: 2017 IEEE International conference on computer vision (ICCV), pp 5068–5076. https://doi.org/10.1109/ICCV.2017.541

  • MarietZ Sara S (2016) Diversity networks: neural network compression using determinantal point processes. In: Proceedings of the 4th international conference on learning representations, pp 67–79

  • Michel P, Levy O, Neubig G (2019a) Are sixteen heads really better than one? Curran Associates Inc., Red Hook. https://doi.org/10.5555/3454287.3455544

  • Michel P, Levy O, Neubig G (2019b) Are sixteen heads really better than one? In: Neural information processing systems

  • Ming Y, Cao S, Zhang R, Li Z, Chen Y, Song Y, Qu H (2017) Understanding hidden memories of recurrent neural networks. In: 2017 IEEE conference on visual analytics science and technology (VAST), pp 13–24. https://doi.org/10.1109/VAST.2017.8585721

  • Ming Y, Xu P, Cheng F, Qu H, Ren L (2020) Protosteer: steering deep sequence model with prototypes. IEEE Trans Vis Comput Graph 26(1):238–248. https://doi.org/10.1109/TVCG.2019.2934267

    Article  Google Scholar 

  • Ming Y, Xu P, Qu H, Ren L (2019) Interpretable and steerable sequence learning via prototypes. In: KDD’19. Association for Computing Machinery, New York, pp 903–913. https://doi.org/10.1145/3292500.3330908

  • Mozer M C, Smolensky P (1988) Skeletonization: a technique for trimming the fat from a network via relevance assessment. In: Proceedings of the 1st international conference on neural information processing systems, NIPS’88. MIT Press, Cambridge, pp 107–115. https://doi.org/10.5555/2969735.2969748

  • Peng Y, Fan X, Chen R, Yu Z, Liu S, Chen Y, Zhao Y, Zhou F (2023) Visual abstraction of dynamic network via improved multi-class blue noise sampling. Front Comput Sci. https://doi.org/10.1007/s11704-021-0609-0

    Article  Google Scholar 

  • Strobelt H, Gehrmann S, Pfister H, Rush AM (2018) Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Trans Vis Comput Graph 24(1):667–676. https://doi.org/10.1109/TVCG.2017.2744158

    Article  Google Scholar 

  • Tan S, Caruana R, Hooker G, Lou Y (2018) Distill-and-compare: auditing black-box models using transparent model distillation. In: Proceedings of the 2018 AAAI/ACM conference on AI, ethics, and society, AIES’18. Association for Computing Machinery, New York, pp 303–310. https://doi.org/10.1145/3278721.3278725

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, NIPS’17. Curran Associates Inc., Red Hook, pp 6000–6010. https://doi.org/10.5555/3295222.3295349

  • Voita E, Talbot D, Moiseev F, Sennrich R, Titov I (2019) Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, pp 5797–5808. https://doi.org/10.18653/v1/P19-1580

  • Wang J, Gou L, Shen H-W, Yang H (2019) Dqnviz: a visual analytics approach to understand deep q-networks. IEEE Trans Vis Comput Graph 25(1):288–298. https://doi.org/10.1109/TVCG.2018.2864504

    Article  Google Scholar 

  • Wang Y, Feng C, Guo C, Chu Y, Hwang J-N (2019) Solving the sparsity problem in recommendations via cross-domain item embedding based on co-clustering. In: Proceedings of the twelfth ACM international conference on web search and data mining, WSDM’19. Association for Computing Machinery, New York, pp 717–725. https://doi.org/10.1145/3289600.3290973

  • Wu Z, Nagarajan T, Kumar A, Rennie S, Davis LS, Grauman K, Feris R (2018) Blockdrop: dynamic inference paths in residual networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8817–8826. https://doi.org/10.1109/CVPR.2018.00919

  • Xia M, Zhong Z, Chen D (2022) Structured pruning learns compact and accurate models. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Dublin, pp 1513–1528. https://doi.org/10.18653/v1/2022.acl-long.107

  • Yang W, Wang X, Lu J, Dou W, Liu S (2021) Interactive steering of hierarchical clustering. IEEE Trans Vis Comput Graph 27(10):3953–3967. https://doi.org/10.1109/TVCG.2020.2995100

    Article  Google Scholar 

  • Yuan J, Chen C, Yang W, Liu M, Xia J, Liu S (2021) A survey of visual analytics techniques for machine learning. Comput Vis Media 7:3–36. https://doi.org/10.1007/41095-020-0191-7

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Zhejiang Provincial Natural Science Foundation of China under Grant No. LY22F020023 and No. LZ22F020015, the National Natural Science Foundation of China under Grant No. 61972122 and No. U22A2033

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangyang Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 21542 KB)

Supplementary file 2 (tsv 3716 KB)

Supplementary file 3 (tsv 192 KB)

Supplementary file 4 (tsv 92 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Sun, H., Sun, H. et al. BHPVAS: visual analysis system for pruning attention heads in BERT model. J Vis (2024). https://doi.org/10.1007/s12650-024-00985-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12650-024-00985-z

Keywords

Navigation