Abstract
Building a language model from free available internet information takes several steps and challenges. This new model aims to be a BERT-based language model for European Portuguese, with no specific context. The corpus was built using a web page archive infrastructure provided by Arquivo.pt and restricted to .pt domains. This paper will describe the overall process of building the corpus and training a BERT model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barbaresi, A.: Trafilatura: a web scraping library and command-line tool for text discovery and extraction. In: Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131. Association for Computational Linguistics (2021). https://aclanthology.org/2021.acl-demo.15
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://aclanthology.org/Q17-1010
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Diouf, R., Sarr, E., Sall, O., Birregah, B., Bousso, M., Mbaye, S.: Web scraping: state-of-the-art and areas of application, pp. 6040–6042 (2019). https://doi.org/10.1109/BigData47090.2019.9005594
Gomes, D., Nogueira, A., Miranda, J., Costa, M.: Introducing the Portuguese web archive initiative. In: 8th International Web Archiving Workshop. Springer, Heidelberg (2009)
Joshi, V., Peters, M., Hopkins, M.: Extending a parser to distant domains using a few dozen partially annotated examples. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1190–1199. Association for Computational Linguistics, Melbourne (2018). https://doi.org/10.18653/v1/P18-1110, https://aclanthology.org/P18-1110
Le, H., et al.: Flaubert: unsupervised language model pre-training for French. CoRR abs/1912.05372 (2019). http://arxiv.org/abs/1912.05372
Lejeune, G., Barbaresi, A.: Bien choisir son outil d’extraction de contenu à partir du web. In: 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition), volume 4: Démonstrations et résumés d’articles internationaux, pp. 46–49. ATALA, AFCP (2020)
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219. Association for Computational Linguistics, Online (2020). https://www.aclweb.org/anthology/2020.acl-main.645
Mattmann, C.A., Zitting, J.L.: Tika in action (2012)
McCandless, M., Hatcher, E., Gospodnetić, O., Gospodnetić, O.: Lucene in Action, vol. 2. Manning Greenwich (2010)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://openai.com/blog/better-language-models/
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Tripathy, J.K., et al.: Comprehensive analysis of embeddings and pre-training in nlp. Comput. Sci. Rev. 42(C) (2021). https://doi.org/10.1016/j.cosrev.2021.100433
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Miquelina, N., Quaresma, P., Nogueira, V.B. (2022). Generating a European Portuguese BERT Based Model Using Content from Arquivo.pt Archive. In: Yin, H., Camacho, D., Tino, P. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2022. IDEAL 2022. Lecture Notes in Computer Science, vol 13756. Springer, Cham. https://doi.org/10.1007/978-3-031-21753-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-21753-1_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21752-4
Online ISBN: 978-3-031-21753-1
eBook Packages: Computer ScienceComputer Science (R0)