Skip to main content

SP-BERT: A Language Model for Political Text in Scandinavian Languages

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2023)

Abstract

Language models are at the core of modern Natural Language Processing. We present a new BERT-style language model dedicated to political texts in Scandinavian languages. Concretely, we introduce SP-BERT, a model trained with parliamentary speeches in Norwegian, Swedish, Danish, and Icelandic. To show its utility, we evaluate its ability to predict the speakers’ party affiliation and explore language shifts of politicians transitioning between Cabinet and Opposition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://data.stortinget.no/om-datatjenesten/bruksvilkar/.

  2. 2.

    https://data.riksdagen.se/data/anforanden/.

  3. 3.

    https://huggingface.co/tumd/sp-bert.

  4. 4.

    https://huggingface.co/bert-base-multilingual-cased.

  5. 5.

    https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html.

  6. 6.

    Due to space limitation, we omit the detailed pre-processing steps.

  7. 7.

    https://spacy.io.

References

  1. Abercrombie, G., Batista-Navarro, R.: Semantic change in the language of UK parliamentary debates. In: Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, pp. 210–215. ACL (2019). https://doi.org/10.18653/v1/W19-4726

  2. Barnes, J., Touileb, S., Øvrelid, L., Velldal, E.: Lexicon information in neural sentiment analysis: a multi-task learning approach. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku, Finland (2019). https://www.aclweb.org/anthology/W19-6119/

  3. Chen, W.F., Al Khatib, K., Wachsmuth, H., Stein, B.: Analyzing political bias and unfairness in news articles at different levels of granularity. In: Proceedings of the 4th Workshop on Natural Language Processing and Computational Social Science, pp. 149–154. ACL (2020). https://doi.org/10.18653/v1/2020.nlpcss-1.16

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019, pp. 4171–4186. ACL (2019). https://doi.org/10.18653/v1/N19-1423

  5. Doan, T.M., Kille, B., Gulla, J.A.: Using language models for classifying the party affiliation of political texts. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds.) NLDB 2022. LNCS, vol. 13286, pp. 382–393. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08473-7_35

  6. Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31865-1_25

    Chapter  Google Scholar 

  7. Hoffman, M.D., Gelman, A., et al.: The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)

    MathSciNet  MATH  Google Scholar 

  8. Hu, Y., et al.: ConfliBERT: a pre-trained language model for political conflict and violence. In: Proceedings of NAACL 2022, pp. 5469–5482. ACL (2022). https://doi.org/10.18653/v1/2022.naacl-main.400

  9. Hvingelby, R., Pauli, A.B., Barrett, M., Rosted, C., Lidegaard, L.M., Søgaard, A.: DaNE: a named entity resource for Danish. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4597–4604 (2020)

    Google Scholar 

  10. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  11. Kummervold, P.E., De la Rosa, J., Wetjen, F., Brygfjeld, S.A.: Operationalizing a national digital library: the case for a Norwegian transformer model. In: Proceedings of NoDaLiDa 2021, pp. 20–29 (2021)

    Google Scholar 

  12. Kutuzov, A., Barnes, J., Velldal, E., Øvrelid, L., Oepen, S.: Large-scale contextualised language modelling for Norwegian. In: Proceedings of NoDaLiDa 2021, pp. 30–40. Linköping University Electronic Press, Sweden, Reykjavik, Iceland (2021)

    Google Scholar 

  13. Lapponi, E., Søyland, M.G., Velldal, E., Oepen, S.: The talk of Norway: a richly annotated corpus of the Norwegian parliament, 1998–2016. Lang. Resour. Eval. 52(3), 873–893 (2018). https://doi.org/10.1007/s10579-018-9411-5

    Article  Google Scholar 

  14. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  15. Liu, Y., Zhang, X.F., Wegsman, D., Beauchamp, N., Wang, L.: POLITICS: pretraining with same-story article comparison for ideology prediction and stance detection. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1354–1374. ACL (2022). https://doi.org/10.18653/v1/2022.findings-naacl.101

  16. Magnusson, M., Öhrvall, R., Barrling, K., Mimno, D.: Voices from the far right: a text analysis of Swedish parliamentary debates (2018)

    Google Scholar 

  17. Malmsten, M., Börjeson, L., Haffenden, C.: Playing with words at the national library of Sweden - making a Swedish BERT. CoRR abs/2007.01658 (2020)

    Google Scholar 

  18. Maronikolakis, A., Sánchez Villegas, D., Preotiuc-Pietro, D., Aletras, N.: Analyzing political parody in social media. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4373–4384. ACL (2020). https://doi.org/10.18653/v1/2020.acl-main.403

  19. Rauh, C., Schwalbach, J.: The ParlSpeech V2 data set: full-text corpora of 6.3 million parliamentary speeches in the key legislative chambers of nine representative democracies (2020). https://doi.org/10.7910/DVN/L4OAKN

  20. Snæbjarnarson, V., et al.: A warm start and a clean crawled corpus — a recipe for good language models. In: Proceedings of the 13rd Language Resources and Evaluation Conference, pp. 4356–4366. European Language Resources Association, Marseille (2022)

    Google Scholar 

  21. Solberg, P.E., Ortiz, P.: The Norwegian parliamentary speech corpus. arXiv preprint arXiv:2201.10881 (2022)

  22. Steingrímsson, S., Barkarson, S., Örnólfsson, G.T.: IGC-Parl: icelandic corpus of parliamentary proceedings. In: Proceedings of the Second ParlaCLARIN Workshop, pp. 11–17. European Language Resources Association (2020). ISBN 979-10-95546-47-4

    Google Scholar 

  23. Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076 (2019)

  24. Walter, T., Kirschner, C., Eger, S., Glavaš, G., Lauscher, A., Ponzetto, S.P.: Diachronic analysis of German parliamentary proceedings: ideological shifts through the lens of political biases. In: 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 51–60 (2021). https://doi.org/10.1109/JCDL52503.2021.00017

Download references

Acknowledgements

This work is done as part of the Trondheim Analytica project and funded under Digital Transformation program at Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tu My Doan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Doan, T.M., Kille, B., Gulla, J.A. (2023). SP-BERT: A Language Model for Political Text in Scandinavian Languages. In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds) Natural Language Processing and Information Systems. NLDB 2023. Lecture Notes in Computer Science, vol 13913. Springer, Cham. https://doi.org/10.1007/978-3-031-35320-8_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35320-8_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35319-2

  • Online ISBN: 978-3-031-35320-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics