A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models

Santos, David; Auquilla, Andrés; Siguenza-Guzman, Lorena; Peña, Mario

doi:10.1007/978-3-030-89941-7_7

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1456))

Included in the following conference series:

Conference on Information and Communication Technologies of Ecuador

449 Accesses

Abstract

Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

BBC. https://www.bbc.com/news/world-us-canada-52860247. Accessed 21 Jan 2021
Vieira, A., Ribeiro, B.: Introduction to Deep Learning Business Applications for Developers. Springer, Hiedelberg (2018). https://doi.org/10.1007/978-1-4842-3453-2
Book Google Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M.: Language models are few-shot learners. arXiv (2020)
Google Scholar
Li, Y., Pan, Q., Wang, S., Yang, T., Cambria, E.: A generative model for category text generation. Inf. Sci. 450, 301–315 (2018)
Article MathSciNet Google Scholar
Bowman, S.R., Angeli, G., Potts, C., Manning, C.: A large annotated corpus for learning natural language inference. In: Empirical Methods in Natural Language Processing, EMNLP 2015, pp. 632–642 (2015)
Google Scholar
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: Computational Natural Language Learning, SIGNLL 2015, pp. 10–21 (2015)
Google Scholar
Logeswaran, L., Honglak, L., Bengio, S.: Content preserving text generation with attribute controls. arXiv:1811.01135 (2018)
Chassagnon, G., Vakalopolou, M., Paragios, N., Revel, M.P.: Deep learning: definition and perspectives for thoracic imaging. Eur. Radiol. 30, 2021–2030 (2020)
Article Google Scholar
Ng, A., Katanforoosh, K., Bensouda, Y.: Deep Learning [MOOC]. COURSERA (2017)
Google Scholar
McEnery, T.: Corpus Linguistics. Oxford University Press, Oxford (2012)
Google Scholar
Ray, S.K., Ahmad, A., Kumar, C.A.: Review and implementation of topic modeling in Hindi. Appl. Artif. Intell. 33(11), 979–1007 (2019)
Article Google Scholar
Keung, P., Lu, Y., Szarvas, G., Smith, N.: The multilingual Amazon reviews corpus. arXiv:2010.02573 (2020)
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017)
Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Empirical Methods in Natural Language Processing 2020, EMNLP-IJCNLP, pp. 188–197 (2020)
Google Scholar
Balahur, A., Turchi, M.: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Comput. Speech Lang. 28(1), 56–75 (2014)
Article Google Scholar
Bengfort, B., Bilbreo, R., Ojeda, T.: Applied Text Analysis with Python, 1st edn. O’Reilly, Sebastopol (2018)
Google Scholar
O’Keeffe, A., McCarthy, M.: The Routledge Handbook of Corpus Linguistics. Routledge, Abingdon (2010)
Google Scholar
Liu, C. J. Han, S.: Bilingual corpus research on Chinese English machine translation in computer centres of Chinese universities. In: Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012, pp. 1720–1723 (2012)
Google Scholar
Hogan, A.: Web of data. In: The Web of Data, pp. 15–57. Springer, Cham (2016). https://doi.org/10.1007/978-3-030-51580-5_2
Minard, A.L., et al.: MEANTIME, the NewsReader multilingual event and time corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation 2016, LREC, pp. 4417–4422 (2016)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics 2002, pp. 311–318 (2002)
Google Scholar
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2016, LREC, pp. 110–119 (2016)
Google Scholar
Conneau, A., et al.: XNLI: evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053 (2018)
Fan, A., Gardent, C.: Multilingual AMR-to-text generation. arXiv:2011.05443 (2020)
Albaum, G.: The Likert scale revisited. Mark. Res. Soc. 39, 1–21 (1997)
Article Google Scholar
He, R., McAuley, J.: Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web, pp. 507–517 (2016)
Google Scholar
Lavie, A: Evaluating the output of machine translation systems (2011). https://www.cs.cmu.edu/~alavie/Presentations/MT-Evaluation-MT-Summit-Tutorial-19Sep11.pdf
Zhang, Y., Vogel, S., Waibel, A.: Interpreting BLEU/NIST scores: how much improvement do we need to have a better system? In: LREC (2004)
Google Scholar

Download references

Acknowledgments

This study is part of the research project “Incorporating Sustainability concepts to management models of textile Micro, Small and Medium Enterprises (SUMA)”, supported by the Flemish Interuniversity Council (VLIR) and the Research Vice-rector of the University of Cuenca (DIUC).

Author information

Authors and Affiliations

Faculty of Engineering, University of Cuenca, 010107, Cuenca, Ecuador
David Santos
Department of Computer Sciences, Faculty of Engineering, University of Cuenca, 010107, Cuenca, Ecuador
Andrés Auquilla & Lorena Siguenza-Guzman
Research Centre Accountancy, Faculty of Economics and Business, KU Leuven, Leuven, Belgium
Lorena Siguenza-Guzman & Mario Peña
Research Department (DIUC), University of Cuenca, 010107, Cuenca, Ecuador
David Santos, Andrés Auquilla, Lorena Siguenza-Guzman & Mario Peña

Authors

David Santos
View author publications
You can also search for this author in PubMed Google Scholar
Andrés Auquilla
View author publications
You can also search for this author in PubMed Google Scholar
Lorena Siguenza-Guzman
View author publications
You can also search for this author in PubMed Google Scholar
Mario Peña
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Santos .

Editor information

Editors and Affiliations

Politecnica Salesiana University, Cuenca, Ecuador
Juan Pablo Salgado Guerrero
Universidad Técnica Particular de Loja, Loja, Ecuador
Janneth Chicaiza Espinosa
Politecnica Salesiana University, Cuenca, Ecuador
Mariela Cerrada Lozada
CEDIA, Cuenca, Ecuador
Santiago Berrezueta-Guzman

A Annexes

1.1 Annex A

https://imagineresearch.org/wp-content/uploads/2021/08/Corpus_AnnexA.pdf

1.2 Annex B

https://imagineresearch.org/wp-content/uploads/2021/08/Corpus_AnnexB.pdf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, D., Auquilla, A., Siguenza-Guzman, L., Peña, M. (2021). A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models. In: Salgado Guerrero, J.P., Chicaiza Espinosa, J., Cerrada Lozada, M., Berrezueta-Guzman, S. (eds) Information and Communication Technologies. TICEC 2021. Communications in Computer and Information Science, vol 1456. Springer, Cham. https://doi.org/10.1007/978-3-030-89941-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-89941-7_7
Published: 23 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89940-0
Online ISBN: 978-3-030-89941-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Annexes

A Annexes

1.1 Annex A

1.2 Annex B

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation