Generating a European Portuguese BERT Based Model Using Content from Arquivo.pt Archive

Miquelina, Nuno; Quaresma, Paulo; Nogueira, Vítor Beires

doi:10.1007/978-3-031-21753-1_28

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13756))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

838 Accesses
2 Citations

Abstract

Building a language model from free available internet information takes several steps and challenges. This new model aims to be a BERT-based language model for European Portuguese, with no specific context. The corpus was built using a web page archive infrastructure provided by Arquivo.pt and restricted to .pt domains. This paper will describe the overall process of building the corpus and training a BERT model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Barbaresi, A.: Trafilatura: a web scraping library and command-line tool for text discovery and extraction. In: Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131. Association for Computational Linguistics (2021). https://aclanthology.org/2021.acl-demo.15
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://aclanthology.org/Q17-1010
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Diouf, R., Sarr, E., Sall, O., Birregah, B., Bousso, M., Mbaye, S.: Web scraping: state-of-the-art and areas of application, pp. 6040–6042 (2019). https://doi.org/10.1109/BigData47090.2019.9005594
Gomes, D., Nogueira, A., Miranda, J., Costa, M.: Introducing the Portuguese web archive initiative. In: 8th International Web Archiving Workshop. Springer, Heidelberg (2009)
Google Scholar
Joshi, V., Peters, M., Hopkins, M.: Extending a parser to distant domains using a few dozen partially annotated examples. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1190–1199. Association for Computational Linguistics, Melbourne (2018). https://doi.org/10.18653/v1/P18-1110, https://aclanthology.org/P18-1110
Le, H., et al.: Flaubert: unsupervised language model pre-training for French. CoRR abs/1912.05372 (2019). http://arxiv.org/abs/1912.05372
Lejeune, G., Barbaresi, A.: Bien choisir son outil d’extraction de contenu à partir du web. In: 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition), volume 4: Démonstrations et résumés d’articles internationaux, pp. 46–49. ATALA, AFCP (2020)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219. Association for Computational Linguistics, Online (2020). https://www.aclweb.org/anthology/2020.acl-main.645
Mattmann, C.A., Zitting, J.L.: Tika in action (2012)
Google Scholar
McCandless, M., Hatcher, E., Gospodnetić, O., Gospodnetić, O.: Lucene in Action, vol. 2. Manning Greenwich (2010)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://openai.com/blog/better-language-models/
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Tripathy, J.K., et al.: Comprehensive analysis of embeddings and pre-training in nlp. Comput. Sci. Rev. 42(C) (2021). https://doi.org/10.1016/j.cosrev.2021.100433
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Download references

Author information

Authors and Affiliations

Universidade de Évora, Évora, Portugal
Nuno Miquelina, Paulo Quaresma & Vítor Beires Nogueira

Authors

Nuno Miquelina
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Quaresma
View author publications
You can also search for this author in PubMed Google Scholar
Vítor Beires Nogueira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nuno Miquelina .

Editor information

Editors and Affiliations

University of Manchester, Manchester, UK
Hujun Yin
Technical University of Madrid, Madrid, Spain
David Camacho
University of Birmingham, Birmingham, UK
Peter Tino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miquelina, N., Quaresma, P., Nogueira, V.B. (2022). Generating a European Portuguese BERT Based Model Using Content from Arquivo.pt Archive. In: Yin, H., Camacho, D., Tino, P. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2022. IDEAL 2022. Lecture Notes in Computer Science, vol 13756. Springer, Cham. https://doi.org/10.1007/978-3-031-21753-1_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-21753-1_28
Published: 21 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21752-4
Online ISBN: 978-3-031-21753-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Generating a European Portuguese BERT Based Model Using Content from Arquivo.pt Archive