Abstract
Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
BBC. https://www.bbc.com/news/world-us-canada-52860247. Accessed 21 Jan 2021
Vieira, A., Ribeiro, B.: Introduction to Deep Learning Business Applications for Developers. Springer, Hiedelberg (2018). https://doi.org/10.1007/978-1-4842-3453-2
Brown, T., Mann, B., Ryder, N., Subbiah, M.: Language models are few-shot learners. arXiv (2020)
Li, Y., Pan, Q., Wang, S., Yang, T., Cambria, E.: A generative model for category text generation. Inf. Sci. 450, 301–315 (2018)
Bowman, S.R., Angeli, G., Potts, C., Manning, C.: A large annotated corpus for learning natural language inference. In: Empirical Methods in Natural Language Processing, EMNLP 2015, pp. 632–642 (2015)
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: Computational Natural Language Learning, SIGNLL 2015, pp. 10–21 (2015)
Logeswaran, L., Honglak, L., Bengio, S.: Content preserving text generation with attribute controls. arXiv:1811.01135 (2018)
Chassagnon, G., Vakalopolou, M., Paragios, N., Revel, M.P.: Deep learning: definition and perspectives for thoracic imaging. Eur. Radiol. 30, 2021–2030 (2020)
Ng, A., Katanforoosh, K., Bensouda, Y.: Deep Learning [MOOC]. COURSERA (2017)
McEnery, T.: Corpus Linguistics. Oxford University Press, Oxford (2012)
Ray, S.K., Ahmad, A., Kumar, C.A.: Review and implementation of topic modeling in Hindi. Appl. Artif. Intell. 33(11), 979–1007 (2019)
Keung, P., Lu, Y., Szarvas, G., Smith, N.: The multilingual Amazon reviews corpus. arXiv:2010.02573 (2020)
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017)
Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Empirical Methods in Natural Language Processing 2020, EMNLP-IJCNLP, pp. 188–197 (2020)
Balahur, A., Turchi, M.: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Comput. Speech Lang. 28(1), 56–75 (2014)
Bengfort, B., Bilbreo, R., Ojeda, T.: Applied Text Analysis with Python, 1st edn. O’Reilly, Sebastopol (2018)
O’Keeffe, A., McCarthy, M.: The Routledge Handbook of Corpus Linguistics. Routledge, Abingdon (2010)
Liu, C. J. Han, S.: Bilingual corpus research on Chinese English machine translation in computer centres of Chinese universities. In: Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012, pp. 1720–1723 (2012)
Hogan, A.: Web of data. In: The Web of Data, pp. 15–57. Springer, Cham (2016). https://doi.org/10.1007/978-3-030-51580-5_2
Minard, A.L., et al.: MEANTIME, the NewsReader multilingual event and time corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation 2016, LREC, pp. 4417–4422 (2016)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics 2002, pp. 311–318 (2002)
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2016, LREC, pp. 110–119 (2016)
Conneau, A., et al.: XNLI: evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053 (2018)
Fan, A., Gardent, C.: Multilingual AMR-to-text generation. arXiv:2011.05443 (2020)
Albaum, G.: The Likert scale revisited. Mark. Res. Soc. 39, 1–21 (1997)
He, R., McAuley, J.: Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web, pp. 507–517 (2016)
Lavie, A: Evaluating the output of machine translation systems (2011). https://www.cs.cmu.edu/~alavie/Presentations/MT-Evaluation-MT-Summit-Tutorial-19Sep11.pdf
Zhang, Y., Vogel, S., Waibel, A.: Interpreting BLEU/NIST scores: how much improvement do we need to have a better system? In: LREC (2004)
Acknowledgments
This study is part of the research project “Incorporating Sustainability concepts to management models of textile Micro, Small and Medium Enterprises (SUMA)”, supported by the Flemish Interuniversity Council (VLIR) and the Research Vice-rector of the University of Cuenca (DIUC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Annexes
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Santos, D., Auquilla, A., Siguenza-Guzman, L., Peña, M. (2021). A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models. In: Salgado Guerrero, J.P., Chicaiza Espinosa, J., Cerrada Lozada, M., Berrezueta-Guzman, S. (eds) Information and Communication Technologies. TICEC 2021. Communications in Computer and Information Science, vol 1456. Springer, Cham. https://doi.org/10.1007/978-3-030-89941-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-89941-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89940-0
Online ISBN: 978-3-030-89941-7
eBook Packages: Computer ScienceComputer Science (R0)