A Study on Improving ALBERT with Additive Attention for Text Classification

Zhang, Zepeng; Chen, Hua; Xiong, Jiagui; Hu, Jiayu; Ni, Wenlong

doi:10.1007/978-3-031-47637-2_15

Zepeng Zhang¹³,
Hua Chen¹³,
Jiagui Xiong¹³,
Jiayu Hu¹³ &
…
Wenlong Ni¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14407))

Included in the following conference series:

Asian Conference on Pattern Recognition

311 Accesses

Abstract

The Transformer has made significant advances in various fields, but high computational costs and lengthy training times pose challenges for models based on this architecture. To address this issue, we propose an improved ALBERT-based model, which replaces ALBERT’s self-attention mechanism with an additive attention mechanism. This modification can reduce computational complexity and enhance the model’s flexibility. We compare our proposed model with other Transformer-based models, demonstrating that it achieves a lower parameter count and significantly reduces computational complexity. Through extensive evaluations on diverse datasets, we establish the superior efficiency of our proposed model over alternative ones. With its reduced parameter count, our proposed model emerges as a promising approach to enhance the efficiency and practicality of Transformer-based models. Notably, it enables practical training under resource and time limitations, highlighting its adaptability and versatility in real-world scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Brown, T.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014)
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Dosovitskiy, A., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Carion, Nicolas, Massa, Francisco, Synnaeve, Gabriel, Usunier, Nicolas, Kirillov, Alexander, Zagoruyko, Sergey: End-to-end object detection with transformers. In: Vedaldi, Andrea, Bischof, Horst, Brox, Thomas, Frahm, Jan-Michael. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451 (2020)
Zhao, G., Lin, J., Zhang, Z., Ren, X., Sun, X.: Sparse transformer: concentrated attention through explicit selection (2019)
Google Scholar
Zaheer, M., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020)
Google Scholar
Wu, C., Wu, F., Qi, T., Huang, Y., Xie, X.: Fastformer: additive attention can be all you need. arXiv preprint arXiv:2108.09084 (2021)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, October 2014
Google Scholar
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150 (2011)
Google Scholar
Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12) (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Jiangxi Normal University, Nanchang, 330022, Jiangxi, China
Zepeng Zhang, Hua Chen, Jiagui Xiong, Jiayu Hu & Wenlong Ni

Authors

Zepeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hua Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiagui Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Jiayu Hu
View author publications
You can also search for this author in PubMed Google Scholar
Wenlong Ni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hua Chen .

Editor information

Editors and Affiliations

Kyushu Institute of Technology, Kitakyushu, Fukuoka, Japan
Huimin Lu
The University of Sydney, Sydney, NSW, Australia
Michael Blumenstein
Yonsei University, Seoul, Korea (Republic of)
Sung-Bae Cho
Chinese Academy of Sciences, Bejing, China
Cheng-Lin Liu
Osaka University, Osaka, Ibaraki, Japan
Yasushi Yagi
Kyushu Institute of Technology, Kitakyushu, Japan
Tohru Kamiya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Chen, H., Xiong, J., Hu, J., Ni, W. (2023). A Study on Improving ALBERT with Additive Attention for Text Classification. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14407. Springer, Cham. https://doi.org/10.1007/978-3-031-47637-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-47637-2_15
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47636-5
Online ISBN: 978-3-031-47637-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics