Extraction and Processing of Web Content for Corpus Creation: A Systematic Literature Review

Luna, Jair Alfredo Flores; Reyes, Miguel Hidalgo; Barradas, Virginia Lagunes

doi:10.1007/978-3-031-50590-4_9

Jair Alfredo Flores Luna⁷,
Miguel Hidalgo Reyes⁷ &
Virginia Lagunes Barradas⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1135))

158 Accesses

Abstract

The processes and methods of text extraction and pre-processing for corpus generation are not widely documented, especially when it comes to Spanish texts. The majority of the documents that collect this information are in English and focus on research carried out in the United States or in Asian countries. The aim of this systematic literature review is to know the state of the art of the technologies and methods used for the extraction of text from web platforms and the pre-processing to generate a specific corpus. Thanks to this review, the issues defined by the research questions have been addressed and an area of opportunity has been identified for the development of new projects focused on the extraction of web information and the creation of corpora to perform analysis in the text mining and natural language processing fields.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Boonmatham, S., & Meesad, P. (2020). Stock price analysis with natural language processing and machine learning. Proceedings of the 11th International Conference on Advances in Information Technology (IAIT2020). Association for Computing Machinery, Article 47, 1–6. https://doi.org/10.1145/3406601.3406652
Ciambrone, G., & Wilson, S. (2023). Creation and analysis of a corpus of scam emails targeting universities. https://doi.org/10.1145/3543873.3587303
Diouf, R., Sarr, E. N., Sall, O., Birregah, B., Bousso, M., & Mbaye, S. N. (2019). Web scraping: State-of-the-art and areas of application. En HAL (Le Centre pour la Communication Scientifique Directe). Le Centre pour la Communication Scientifique Directe. https://doi.org/10.1109/bigdata47090.2019.9005594
Gorro, K. D., Ali, M. F., Gorro, K. D., & Ancheta, J. M. (2020). Exploring natural language processing techniques in social media analysis during a pandemic. In International conference on information technology. https://doi.org/10.1145/3446999.3447012
Hart, K. L., Perlis, R. H., & McCoy, T. P. (2020). What do patients learn about psychotropic medications on the web? A natural language processing study. Journal of Affective Disorders, 260, 366–371. https://doi.org/10.1016/j.jad.2019.09.043
Article Google Scholar
Kitchenham, B. (2007). Guidelines for performing systematic literature reviews in software engineering. EBSE Technical Report EBSE-2007-01.
Google Scholar
Van Koevering, K., Benson, A. R., & Kleinberg, J. (2020). Frozen binomials on the web: Word ordering and language conventions in online text. https://doi.org/10.1145/3366423.3380143
Article Google Scholar
Moghadasi, M. N., Zhuang, Y., & Gellban, H. (2020). Robo: A counselor chatbot for opioid addicted patients. 2020 2nd Symposium on Signal Processing Systems. https://doi.org/10.1145/3421515.3421525
Parvez, M. S., Tasneem, K. S. A., Rajendra, S. S., & Bodke, K. (2018). Analysis of different web data extraction techniques. In 2018 international conference on smart city and emerging technology (ICSCET). https://doi.org/10.1109/icscet.2018.8537333
Tamayo, S., Combes, F., & Gaudron, A. (2020). Unsupervised machine learning to analyze city logistics through Twitter. Transportation research procedia, 46, 220–228. https://doi.org/10.1016/j.trpro.2020.03.184
Article Google Scholar
Vanden Broucke, S., & Baesens, B. (2018). Practical web scraping for data science. En Apress eBooks. Apress. https://doi.org/10.1007/978-1-4842-3582-9
Yang, J., Yi, X., Cheng, D. Z., Hong, L., Li, Y., Wang, S., Taibai, X., & Chi, E. H. (2020). Mixed negative sampling for learning two-tower neural networks in R recommendations. In Companion proceedings of the web conference 2020. https://doi.org/10.1145/3366424.3386195

Download references

Author information

Authors and Affiliations

Tecnológico Nacional de México/Instituto Tecnológico Superior de Xalapa, C.P. 91096, Xalapa, México
Jair Alfredo Flores Luna, Miguel Hidalgo Reyes & Virginia Lagunes Barradas

Authors

Jair Alfredo Flores Luna
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Hidalgo Reyes
View author publications
You can also search for this author in PubMed Google Scholar
Virginia Lagunes Barradas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel Hidalgo Reyes .

Editor information

Editors and Affiliations

Quantum Park Zacatecas, Zacatecas, Mexico
Jezreel Mejía
Quantum Park Zacatecas, Zacatecas, Mexico
Mirna Muñoz
Lisbon, Portugal
Alvaro Rocha
Interior Internado Palmira S/N, Cuernavaca, Morelos, Mexico
Yasmin Hernández Pérez
Ameca, Jalisco, Mexico
Himer Avila-George

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Luna, J.A.F., Reyes, M.H., Barradas, V.L. (2024). Extraction and Processing of Web Content for Corpus Creation: A Systematic Literature Review. In: Mejía, J., Muñoz, M., Rocha, A., Hernández Pérez, Y., Avila-George, H. (eds) New Perspectives in Software Engineering. Studies in Computational Intelligence, vol 1135. Springer, Cham. https://doi.org/10.1007/978-3-031-50590-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-50590-4_9
Published: 21 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50589-8
Online ISBN: 978-3-031-50590-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics