How BERT’s Dropout Fine-Tuning Affects Text Classification?

El Anigri, Salma; Himmi, Mohammed Majid; Mahmoudi, Abdelhak

doi:10.1007/978-3-030-76508-8_11

How BERT’s Dropout Fine-Tuning Affects Text Classification?

Conference paper
First Online: 16 May 2021

1499 Accesses
2 Citations

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 416))

Abstract

Language models pretraining facilitated fitting models on new and small datasets by keeping the previous pretraining knowledge. The task-agnostic models are to be fine-tuned on all NLP tasks. In this paper, we study the fine-tuning effect of BERT on small amount of data for news classification and sentiment analysis. Our experiments highlight the impact of tweaking the dropout hyper-parameters on the classification performance. We conclude that combining the hidden layers and the attention dropouts probabilities reduce overfitting.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.

References

Adhikari, A., Ram, A., Tang, R., Lin, J.: DocBERT: BERT for document classification. arXiv preprint arXiv:190408398 (2019)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/7137debd45ae4d0ab9aa953017286b20-Paper.pdf
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 (2018)
Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., Smith, N.: Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. arXiv preprint arXiv:200206305 (2020)
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Association for Computational Linguistics, Melbourne (2018). https://doi.org/10.18653/v1/P18-1031. https://www.aclweb.org/anthology/P18-1031
Huang, L., Ma, D., Li, S., Zhang, X., Wang, H.: Text level graph neural network for text classification. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3444–3450. Association for Computational Linguistics, Hong Kong (2019). https://doi.org/10.18653/v1/D19-1345. https://www.aclweb.org/anthology/D19-1345
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics, Doha (2014). https://doi.org/10.3115/v1/D14-1181. https://www.aclweb.org/anthology/D14-1181
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:190911942 (2019)
Lee, C., Cho, K., Kang, W.: Mixout: effective regularization to finetune large-scale pretrained language models. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=HkgaETNtDB
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:190711692 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781 (2013)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:180205365 (2018)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding with unsupervised learning (2018)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Association for Computational Linguistics, Austin (2016). https://doi.org/10.18653/v1/D16-1264. https://www.aclweb.org/anthology/D16-1264
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:170603762 (2017)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf
Zellers, R., Bisk, Y., Schwartz, R., Choi, Y.: SWAG: a large-scale adversarial dataset for grounded commonsense inference. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 93–104. Association for Computational Linguistics, Brussels (2018). https://doi.org/10.18653/v1/D18-1009. https://www.aclweb.org/anthology/D18-1009
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. arXiv preprint arXiv:150901626 (2015)

Download references

Author information

Authors and Affiliations

LIMIARF Laboratory, Faculty of Sciences, Mohammed V University, Rabat, Morocco
Salma El Anigri & Mohammed Majid Himmi
LIMIARF Laboratory, Ecole Normale Supérieure, Mohammed V University, Rabat, Morocco
Abdelhak Mahmoudi

Authors

Salma El Anigri
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Majid Himmi
View author publications
You can also search for this author in PubMed Google Scholar
Abdelhak Mahmoudi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salma El Anigri .

Editor information

Editors and Affiliations

Université Sultan Moulay Slimane, Beni-Mellal, Morocco
Mohamed Fakir
Université Sultan Moulay Slimane, Beni-Mellal, Morocco
Mohamed Baslam
Université Sultan Moulay Slimane, Beni-Mellal, Morocco
Rachid El Ayachi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

El Anigri, S., Himmi, M.M., Mahmoudi, A. (2021). How BERT’s Dropout Fine-Tuning Affects Text Classification?. In: Fakir, M., Baslam, M., El Ayachi, R. (eds) Business Intelligence. CBI 2021. Lecture Notes in Business Information Processing, vol 416. Springer, Cham. https://doi.org/10.1007/978-3-030-76508-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-76508-8_11
Published: 16 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76507-1
Online ISBN: 978-3-030-76508-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics