Skip to main content

Extraction and Processing of Web Content for Corpus Creation: A Systematic Literature Review

  • Chapter
  • First Online:
New Perspectives in Software Engineering

Abstract

The processes and methods of text extraction and pre-processing for corpus generation are not widely documented, especially when it comes to Spanish texts. The majority of the documents that collect this information are in English and focus on research carried out in the United States or in Asian countries. The aim of this systematic literature review is to know the state of the art of the technologies and methods used for the extraction of text from web platforms and the pre-processing to generate a specific corpus. Thanks to this review, the issues defined by the research questions have been addressed and an area of opportunity has been identified for the development of new projects focused on the extraction of web information and the creation of corpora to perform analysis in the text mining and natural language processing fields.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Boonmatham, S., & Meesad, P. (2020). Stock price analysis with natural language processing and machine learning. Proceedings of the 11th International Conference on Advances in Information Technology (IAIT2020). Association for Computing Machinery, Article 47, 1–6. https://doi.org/10.1145/3406601.3406652

  2. Ciambrone, G., & Wilson, S. (2023). Creation and analysis of a corpus of scam emails targeting universities. https://doi.org/10.1145/3543873.3587303

  3. Diouf, R., Sarr, E. N., Sall, O., Birregah, B., Bousso, M., & Mbaye, S. N. (2019). Web scraping: State-of-the-art and areas of application. En HAL (Le Centre pour la Communication Scientifique Directe). Le Centre pour la Communication Scientifique Directe. https://doi.org/10.1109/bigdata47090.2019.9005594

  4. Gorro, K. D., Ali, M. F., Gorro, K. D., & Ancheta, J. M. (2020). Exploring natural language processing techniques in social media analysis during a pandemic. In International conference on information technology. https://doi.org/10.1145/3446999.3447012

  5. Hart, K. L., Perlis, R. H., & McCoy, T. P. (2020). What do patients learn about psychotropic medications on the web? A natural language processing study. Journal of Affective Disorders, 260, 366–371. https://doi.org/10.1016/j.jad.2019.09.043

    Article  Google Scholar 

  6. Kitchenham, B. (2007). Guidelines for performing systematic literature reviews in software engineering. EBSE Technical Report EBSE-2007-01.

    Google Scholar 

  7. Van Koevering, K., Benson, A. R., & Kleinberg, J. (2020). Frozen binomials on the web: Word ordering and language conventions in online text. https://doi.org/10.1145/3366423.3380143

    Article  Google Scholar 

  8. Moghadasi, M. N., Zhuang, Y., & Gellban, H. (2020). Robo: A counselor chatbot for opioid addicted patients. 2020 2nd Symposium on Signal Processing Systems. https://doi.org/10.1145/3421515.3421525

  9. Parvez, M. S., Tasneem, K. S. A., Rajendra, S. S., & Bodke, K. (2018). Analysis of different web data extraction techniques. In 2018 international conference on smart city and emerging technology (ICSCET). https://doi.org/10.1109/icscet.2018.8537333

  10. Tamayo, S., Combes, F., & Gaudron, A. (2020). Unsupervised machine learning to analyze city logistics through Twitter. Transportation research procedia, 46, 220–228. https://doi.org/10.1016/j.trpro.2020.03.184

    Article  Google Scholar 

  11. Vanden Broucke, S., & Baesens, B. (2018). Practical web scraping for data science. En Apress eBooks. Apress. https://doi.org/10.1007/978-1-4842-3582-9

  12. Yang, J., Yi, X., Cheng, D. Z., Hong, L., Li, Y., Wang, S., Taibai, X., & Chi, E. H. (2020). Mixed negative sampling for learning two-tower neural networks in R recommendations. In Companion proceedings of the web conference 2020. https://doi.org/10.1145/3366424.3386195

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miguel Hidalgo Reyes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Luna, J.A.F., Reyes, M.H., Barradas, V.L. (2024). Extraction and Processing of Web Content for Corpus Creation: A Systematic Literature Review. In: Mejía, J., Muñoz, M., Rocha, A., Hernández Pérez, Y., Avila-George, H. (eds) New Perspectives in Software Engineering. Studies in Computational Intelligence, vol 1135. Springer, Cham. https://doi.org/10.1007/978-3-031-50590-4_9

Download citation

Publish with us

Policies and ethics