Skip to main content

A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models

  • Conference paper
  • First Online:
Information and Communication Technologies (TICEC 2021)

Abstract

Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. BBC. https://www.bbc.com/news/world-us-canada-52860247. Accessed 21 Jan 2021

  2. Vieira, A., Ribeiro, B.: Introduction to Deep Learning Business Applications for Developers. Springer, Hiedelberg (2018). https://doi.org/10.1007/978-1-4842-3453-2

    Book  Google Scholar 

  3. Brown, T., Mann, B., Ryder, N., Subbiah, M.: Language models are few-shot learners. arXiv (2020)

    Google Scholar 

  4. Li, Y., Pan, Q., Wang, S., Yang, T., Cambria, E.: A generative model for category text generation. Inf. Sci. 450, 301–315 (2018)

    Article  MathSciNet  Google Scholar 

  5. Bowman, S.R., Angeli, G., Potts, C., Manning, C.: A large annotated corpus for learning natural language inference. In: Empirical Methods in Natural Language Processing, EMNLP 2015, pp. 632–642 (2015)

    Google Scholar 

  6. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: Computational Natural Language Learning, SIGNLL 2015, pp. 10–21 (2015)

    Google Scholar 

  7. Logeswaran, L., Honglak, L., Bengio, S.: Content preserving text generation with attribute controls. arXiv:1811.01135 (2018)

  8. Chassagnon, G., Vakalopolou, M., Paragios, N., Revel, M.P.: Deep learning: definition and perspectives for thoracic imaging. Eur. Radiol. 30, 2021–2030 (2020)

    Article  Google Scholar 

  9. Ng, A., Katanforoosh, K., Bensouda, Y.: Deep Learning [MOOC]. COURSERA (2017)

    Google Scholar 

  10. McEnery, T.: Corpus Linguistics. Oxford University Press, Oxford (2012)

    Google Scholar 

  11. Ray, S.K., Ahmad, A., Kumar, C.A.: Review and implementation of topic modeling in Hindi. Appl. Artif. Intell. 33(11), 979–1007 (2019)

    Article  Google Scholar 

  12. Keung, P., Lu, Y., Szarvas, G., Smith, N.: The multilingual Amazon reviews corpus. arXiv:2010.02573 (2020)

  13. Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017)

  14. Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Empirical Methods in Natural Language Processing 2020, EMNLP-IJCNLP, pp. 188–197 (2020)

    Google Scholar 

  15. Balahur, A., Turchi, M.: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Comput. Speech Lang. 28(1), 56–75 (2014)

    Article  Google Scholar 

  16. Bengfort, B., Bilbreo, R., Ojeda, T.: Applied Text Analysis with Python, 1st edn. O’Reilly, Sebastopol (2018)

    Google Scholar 

  17. O’Keeffe, A., McCarthy, M.: The Routledge Handbook of Corpus Linguistics. Routledge, Abingdon (2010)

    Google Scholar 

  18. Liu, C. J. Han, S.: Bilingual corpus research on Chinese English machine translation in computer centres of Chinese universities. In: Proceedings - 2012 International Conference on Computer Science and Service System, CSSS 2012, pp. 1720–1723 (2012)

    Google Scholar 

  19. Hogan, A.: Web of data. In: The Web of Data, pp. 15–57. Springer, Cham (2016). https://doi.org/10.1007/978-3-030-51580-5_2

  20. Minard, A.L., et al.: MEANTIME, the NewsReader multilingual event and time corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation 2016, LREC, pp. 4417–4422 (2016)

    Google Scholar 

  21. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics 2002, pp. 311–318 (2002)

    Google Scholar 

  22. Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2016, LREC, pp. 110–119 (2016)

    Google Scholar 

  23. Conneau, A., et al.: XNLI: evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053 (2018)

  24. Fan, A., Gardent, C.: Multilingual AMR-to-text generation. arXiv:2011.05443 (2020)

  25. Albaum, G.: The Likert scale revisited. Mark. Res. Soc. 39, 1–21 (1997)

    Article  Google Scholar 

  26. He, R., McAuley, J.: Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th International Conference on World Wide Web, pp. 507–517 (2016)

    Google Scholar 

  27. Lavie, A: Evaluating the output of machine translation systems (2011). https://www.cs.cmu.edu/~alavie/Presentations/MT-Evaluation-MT-Summit-Tutorial-19Sep11.pdf

  28. Zhang, Y., Vogel, S., Waibel, A.: Interpreting BLEU/NIST scores: how much improvement do we need to have a better system? In: LREC (2004)

    Google Scholar 

Download references

Acknowledgments

This study is part of the research project “Incorporating Sustainability concepts to management models of textile Micro, Small and Medium Enterprises (SUMA)”, supported by the Flemish Interuniversity Council (VLIR) and the Research Vice-rector of the University of Cuenca (DIUC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Santos .

Editor information

Editors and Affiliations

A Annexes

A Annexes

1.1 Annex A

https://imagineresearch.org/wp-content/uploads/2021/08/Corpus_AnnexA.pdf

1.2 Annex B

https://imagineresearch.org/wp-content/uploads/2021/08/Corpus_AnnexB.pdf

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santos, D., Auquilla, A., Siguenza-Guzman, L., Peña, M. (2021). A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models. In: Salgado Guerrero, J.P., Chicaiza Espinosa, J., Cerrada Lozada, M., Berrezueta-Guzman, S. (eds) Information and Communication Technologies. TICEC 2021. Communications in Computer and Information Science, vol 1456. Springer, Cham. https://doi.org/10.1007/978-3-030-89941-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89941-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89940-0

  • Online ISBN: 978-3-030-89941-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics