Skip to main content

ActiveGLAE: A Benchmark for Deep Active Learning with Transformers

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Abstract

Deep active learning (DAL) seeks to reduce annotation costs by enabling the model to actively query instance annotations from which it expects to learn the most. Despite extensive research, there is currently no standardized evaluation protocol for transformer-based language models in the field of DAL. Diverse experimental settings lead to difficulties in comparing research and deriving recommendations for practitioners. To tackle this challenge, we propose the ActiveGLAE benchmark, a comprehensive collection of data sets and evaluation guidelines for assessing DAL. Our benchmark aims to facilitate and streamline the evaluation process of novel DAL strategies. Additionally, we provide an extensive overview of current practice in DAL with transformer-based language models. We identify three key challenges - data set selection, model training, and DAL settings - that pose difficulties in comparing query strategies. We establish baseline results through an extensive set of experiments as a reference point for evaluating future work. Based on our findings, we provide guidelines for researchers and practitioners.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Github repository.

  2. 2.

    Weights and Biases project.

  3. 3.

    The appendix can be accessed at ArXiv.

References

  1. Ash, J.T., Adams, R.P.: On warm-starting neural network training. CoRR (2020). https://doi.org/10.48550/arXiv.1910.08475

  2. Aßenmacher, M., Heumann, C.: On the comparability of pre-trained language models. In: Proceedings of the 5th Swiss Text Analytics Conference and 16th Conference on Natural Language Processing. CEUR Workshop Proceedings, Zurich, Switzerland, June 2020 (2020, Online). http://ceur-ws.org/Vol-2624/paper2.pdf

  3. Beck, N., Sivasubramanian, D., Dani, A., Ramakrishnan, G., Iyer, R.: Effective evaluation of deep active learning on image classification tasks (2021). https://doi.org/10.48550/arXiv.2106.15324

  4. Biewald, L.: Experiment tracking with weights and biases (2020). Software available from wandb.com. https://www.wandb.com/

  5. Brodersen, K.H., Ong, C.S., Stephan, K.E., Buhmann, J.M.: The balanced accuracy and its posterior distribution. In: 2010 20th International Conference on Pattern Recognition, August 2010, pp. 3121–3124 (2010). https://doi.org/10.1109/ICPR.2010.764

  6. Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  7. Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., Vulić, I.: Efficient intent detection with dual sentence encoders. In: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, July 2020, pp. 38–45. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.nlp4convai-1.5. https://aclanthology.org/2020.nlp4convai-1.5

  8. D’Arcy, M., Downey, D.: Limitations of active learning with deep transformer language models (2022). https://openreview.net/forum?id=Q8OjAGkxwP5

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  10. Ein-Dor, L., et al.: Active learning for BERT: an empirical study. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7949–7962. Association for Computational Linguistics (2020, Online). https://doi.org/10.18653/v1/2020.emnlp-main.638

  11. Ferreira, W., Vlachos, A.: Emergent: a novel data-set for stance classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, June 2016, pp. 1163–1168. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1138. https://aclanthology.org/N16-1138

  12. Gao, L., et al.: The pile: an 800 GB dataset of diverse text for language modeling. CoRR (2020). https://doi.org/10.48550/arXiv.2101.00027

  13. Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, August 2021, pp. 3816–3830. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.295. https://aclanthology.org/2021.acl-long.295

  14. Gonsior, J., Falkenberg, C., Magino, S., Reusch, A., Thiele, M., Lehner, W.: To Softmax, or not to Softmax: that is the question when applying active learning for transformer models. CoRR (2022). https://doi.org/10.48550/arXiv.2210.03005

  15. Hacohen, G., Dekel, A., Weinshall, D.: Active learning on a budget: opposite strategies suit high and low budgets (2022). https://doi.org/10.48550/arXiv.2202.02794

  16. Herde, M., Huseljic, D., Sick, B., Calma, A.: A survey on cost types, interaction schemes, and annotator performance models in selection algorithms for active learning in classification. CoRR (2021). https://doi.org/10.48550/arXiv.2109.11301

  17. Hu, P., Lipton, Z.C., Anandkumar, A., Ramanan, D.: Active learning with partial feedback. CoRR (2019). https://doi.org/10.48550/arXiv.1802.07427

  18. Ji, Y., Kaestner, D., Wirth, O., Wressnegger, C.: Randomness is the root of all evil: more reliable evaluation of deep active learning. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, pp. 3932–3941. IEEE (2023). https://doi.org/10.1109/WACV56688.2023.00393

  19. Jukić, J., Šnajder, J.: Smooth sailing: improving active learning for pre-trained language models with representation smoothness analysis (2022). https://doi.org/10.48550/arXiv.2212.11680

  20. Kottke, D., Calma, A., Huseljic, D., Krempl, G., Sick, B., et al.: Challenges of reliable, realistic and comparable active learning evaluation. In: Proceedings of the Workshop and Tutorial on Interactive Adaptive Learning, pp. 2–14 (2017)

    Google Scholar 

  21. Kwak, B., Kim, Y., Kim, Y.J., Hwang, S., Yeo, J.: TrustAL: trustworthy active learning using knowledge distillation (2022). https://doi.org/10.48550/arXiv.2201.11661

  22. Lang, A., Mayer, C., Timofte, R.: Best practices in pool-based active learning for image classification (2022). https://openreview.net/forum?id=7Rnf1F7rQhR

  23. Lehmann, J., et al.: DBpedia – a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015). https://doi.org/10.3233/SW-140134

    Article  Google Scholar 

  24. Lhoest, Q., et al.: Datasets: a community library for natural language processing. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online and Punta Cana, Dominican Republic, pp. 175–184. Association for Computational Linguistics (2021). https://aclanthology.org/2021.emnlp-demo.21

  25. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, vol. 1, pp. 1–7. Association for Computational Linguistics (2002). https://doi.org/10.3115/1072228.1072378

  26. Li, Y., Chen, M., Liu, Y., He, D., Xu, Q.: An empirical study on the efficacy of deep active learning for image classification (November 2022)

    Google Scholar 

  27. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR (2019). https://doi.org/10.48550/arXiv.1907.11692

  28. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. CoRR (2017). https://doi.org/10.48550/arXiv.1711.05101

  29. Lu, J., MacNamee, B.: Investigating the effectiveness of representations based on pretrained transformer-based language models in active learning for labelling text datasets (2020). https://doi.org/10.48550/arXiv.2004.13138

  30. Lüth, C.T., Bungert, T.J., Klein, L., Jaeger, P.F.: Toward realistic evaluation of deep active learning algorithms in image classification. CoRR (2023). https://doi.org/10.48550/arXiv.2301.10625

  31. Margatina, K., Barrault, L., Aletras, N.: Bayesian active learning with pretrained language models. CoRR (2021). https://doi.org/10.48550/arXiv.2104.08320

  32. Margatina, K., Barrault, L., Aletras, N.: On the importance of effectively adapting pretrained language models for active learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, May 2022, pp. 825–836. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.acl-short.93. https://aclanthology.org/2022.acl-short.93

  33. Margatina, K., Vernikos, G., Barrault, L., Aletras, N.: Active learning by acquiring contrastive examples. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 650–663. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.51

  34. Mosbach, M., Andriushchenko, M., Klakow, D.: On the stability of fine-tuning BERT: misconceptions, explanations, and strong baselines (2021). https://doi.org/10.48550/arXiv.2006.04884

  35. Munjal, P., Hayat, N., Hayat, M., Sourati, J., Khan, S.: Towards robust and reproducible active learning using neural networks (2022). https://doi.org/10.48550/arXiv.2002.09564

  36. OpenAI: ChatGPT: optimizing language models for dialogue (2022). https://openai.com/blog/chatgpt/. Accessed 10 Jan 2023

  37. Perez, E., Kiela, D., Cho, K.: True few-shot learning with language models. CoRR (2021). https://doi.org/10.48550/arXiv.2105.11447

  38. Pomerleau, D., Rao, D.: Fake news challenge (2017). http://www.fakenewschallenge.org/

  39. Prabhu, S., Mohamed, M., Misra, H.: Multi-class text classification using BERT-based active learning. CoRR (2021). https://doi.org/10.48550/arXiv.2104.14289

  40. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAi Blog (2019)

    Google Scholar 

  41. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  42. Rauch, L., Huseljic, D., Sick, B.: Enhancing active learning with weak supervision and transfer learning by leveraging information and knowledge sources. In: IAL@PKDD/ECML (2022)

    Google Scholar 

  43. Ren, P., Xiao, Y., Chang, X., Huang, P.Y., Li, Z., Gupta, B.B., Chen, X., Wang, X.: A survey of deep active learning. ACM Comput. Surv. 54(9) (2021). https://doi.org/10.1145/3472291

  44. Ren, P., et al.: A survey of deep active learning. ACM Comput. Surv. 54(9) (2021). https://doi.org/10.1145/3472291

  45. Ru, D., et al.: Active sentence learning by adversarial uncertainty sampling in discrete space. In: Findings of the Association for Computational Linguistics, EMNLP 2020, Online, November 2020, pp. 4908–4917. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.441. https://aclanthology.org/2020.findings-emnlp.441

  46. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR (2020). https://doi.org/10.48550/arXiv.1910.01108

  47. Schick, T., Schütze, H.: Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255–269. Association for Computational Linguistics (2021, Online). https://doi.org/10.18653/v1/2021.eacl-main.20. https://aclanthology.org/2021.eacl-main.20

  48. Schick, T., Schütze, H.: It’s not just size that matters: small language models are also few-shot learners. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352. Association for Computational Linguistics (2021, Online). https://doi.org/10.18653/v1/2021.naacl-main.185. https://aclanthology.org/2021.naacl-main.185

  49. Schröder, C., Niekler, A., Potthast, M.: Revisiting uncertainty-based query strategies for active learning with transformers. In: Findings of the Association for Computational Linguistics, ACL 2022, Dublin, Ireland, pp. 2194–2203. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.findings-acl.172. https://aclanthology.org/2022.findings-acl.172

  50. Seo, S., Kim, D., Ahn, Y., Lee, K.H.: Active learning on pre-trained language model with task-independent triplet loss. Proc. AAAI Conf. Artif. Intell. 36(10), 11276–11284 (2022). https://doi.org/10.1609/aaai.v36i10.21378

    Article  Google Scholar 

  51. Settles, B.: Active learning literature survey. Computer Sciences, Technical report, 1648, University of Wisconsin-Madison (2010)

    Google Scholar 

  52. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, October 2013, pp. 1631–1642. Association for Computational Linguistics (2013). https://aclanthology.org/D13-1170

  53. Tan, W., Du, L., Buntine, W.: Diversity enhanced active learning with strictly proper scoring rules. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35 (2021)

    Google Scholar 

  54. Tran, D., et al.: Plex: towards Reliability using Pretrained Large Model Extensions. CoRR (2022). https://doi.org/10.48550/arXiv.2207.07411

  55. Wang, A., et al.: SuperGLUE: a stickier benchmark for general-purpose language understanding systems. CoRR (2020). https://doi.org/10.48550/arXiv.1905.00537

  56. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. CoRR (2018). https://doi.org/10.48550/arXiv.1804.07461

  57. Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference, June 2018. https://doi.org/10.18653/v1/N18-1101. https://aclanthology.org/N18-1101

  58. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6. https://aclanthology.org/2020.emnlp-demos.6

  59. Wulczyn, E., Thain, N., Dixon, L.: Ex Machina: personal attacks seen at scale. In: Proceedings of the 26th International Conference on World Wide Web, Republic and Canton of Geneva, CHE, pp. 1391–1399. International World Wide Web Conferences Steering Committee (2017). https://doi.org/10.1145/3038912.3052591. https://doi.org/10.1145/3038912.3052591

  60. Yi, J.S.K., Seo, M., Park, J., Choi, D.G.: PT4AL: using self-supervised pretext tasks for active learning, July 2022

    Google Scholar 

  61. Yu, Y., Kong, L., Zhang, J., Zhang, R., Zhang, C.: AcTune: uncertainty-based active self-training for active fine-tuning of pretrained language models. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, pp. 1422–1436. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.naacl-main.102

  62. Yu, Y., Zhang, R., Xu, R., Zhang, J., Shen, J., Zhang, C.: Cold-start data selection for few-shot language model fine-tuning: a prompt-based uncertainty propagation approach (2022). https://doi.org/10.48550/arXiv.2209.06995

  63. Yuan, M., Lin, H.T., Boyd-Graber, J.: Cold-start active learning through self-supervised language modeling. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 7935–7948. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.637

  64. Zha, D., Bhat, Z.P., Lai, K.H., Yang, F., Hu, X.: Data-centric AI: perspectives and Challenges (2023). https://doi.org/10.48550/arXiv.2301.04819

  65. Zhang, S., Gong, C., Liu, X., He, P., Chen, W., Zhou, M.: ALLSH: active learning guided by local sensitivity and hardness. In: Findings of the Association for Computational Linguistics, NAACL 2022, Seattle, United States, July 2022, pp. 1328–1342. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.findings-naacl.99. https://aclanthology.org/2022.findings-naacl.99

  66. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)

    Google Scholar 

Download references

Acknowledgments

This work received partial funding from the WIBank Hessen (EFRE) for the project INFINA, and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), as part of BERD@NFDI - grant number 460037581.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lukas Rauch .

Editor information

Editors and Affiliations

Ethics declarations

Ethics Statement

Limitations. This work represents a snapshot of the current practice of applying DAL to the NLP domain. This comes with the limitation that we do not explicitly go into detail about other fields, like e.g. Computer Vision or Speech Processing. Especially in the former of these two fields, an active research community is working on evaluating DAL [3, 18, 22, 26, 30], and the difference in modalities make a comparison across fields highly intriguing. Further, the experimental outcome of our work is not exhaustive. We tested a limited number of models and query strategies on a given number of data sets, which we considered representative, controlling for as much exogenous influence as possible. This should be seen as a blueprint for the experimental setup rather than a definite statement about the SOTA.

Ethical Considerations. To the best of our knowledge, no ethical considerations are implied by our work. There are only two aspects that are affected in a broader sense. First, the environmental impact of the computationally expensive experiments that come with evaluating deep active learning (DAL) strategies. Given the ever-increasing model sizes and the already controversial debate around this topic, this is a crucial aspect to consider. The second point to be addressed is substituting human labor for labeling data sets by DAL. Especially when it comes to labeling toxic or explicit content, suitable DAL strategies might be one way to limit human exposure to such data.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rauch, L., Aßenmacher, M., Huseljic, D., Wirth, M., Bischl, B., Sick, B. (2023). ActiveGLAE: A Benchmark for Deep Active Learning with Transformers. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14169. Springer, Cham. https://doi.org/10.1007/978-3-031-43412-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43412-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43411-2

  • Online ISBN: 978-3-031-43412-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics