Skip to main content

Data from Multiple Web Sources: Crawling, Integrating, Preprocessing, and Designing Applications

  • Chapter
  • First Online:
Book cover Special Topics in Multimedia, IoT and Web Technologies

Abstract

Data from the Web are increasingly heterogeneous and unstructured, representing challenges for data crawling, integration, and preprocessing. There are studies that are “data oriented,” i.e., their work is developed to deal with some problem generated by available data, hence their results are restricted to the respective data. In contrast, there are various problems prior to identifying what data is needed to a specific study, and often multiple data sources are needed. This chapter covers such problems with definitions, current solutions, possible issues, and future work. Especially, the first issue in dealing with data coming from the Web is to define the crawling strategy, which can be classified according to the period and how to start it. The second issue is to define a strategy for integrating data from different sources to have a uniform view for users or applications, and to store them in a way that allows efficient consultation. Note that a possibility is to collect data from each source and store them separately for later integration, or to store all data in a single location in an integrated fashion as each collection is performed. The third issue is data preprocessing, which takes place before or after the data integration, and involves solving missing and duplicate data, normalization, data veracity, etc. Overall, this chapter addresses these three issues in an integrated way with a focus on practical and research questions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Open Knowledge Foundation: https://okfn.org/.

  2. 2.

    Brazilian Open Data Portal: http://dados.gov.br.

  3. 3.

    data.gov: https://www.data.gov.

  4. 4.

    European Data Portal: http://europeandataportal.eu.

  5. 5.

    CKAN:https://ckan.org.

  6. 6.

    Socrata: https://socrata.com.

  7. 7.

    GeoServer: http://geoserver.org.

  8. 8.

    MapServer: https://mapserver.org.

  9. 9.

    OGC: http://www.opengeospatial.org.

  10. 10.

    FOAF: http://xmlns.com/foaf/spec/.

  11. 11.

    W3C: https://www.w3.org/standards/semanticWeb/ontology.

  12. 12.

    https://www.w3.org/TR/rdf11-primer/#section-graph-syntax.

  13. 13.

    LOD Cloud: https://lod-cloud.net/.

  14. 14.

    HTML Microdata: https://www.w3.org/TR/microdata/.

  15. 15.

    Plain Old Semantic HTML: http://microformats.org/wiki/posh.

  16. 16.

    Microformats wiki: http://microformats.org/wiki/Main_Page.

  17. 17.

    Web Services Architecture: https://www.w3.org/TR/ws-arch/.

  18. 18.

    Difference between ETL and data integration: https://www.passionned.com/is-data-integration-becoming-the-new-etl/. Accessed on February 4, 2019.

  19. 19.

    In graph theory, a clique is a set of vertices in a graph where each vertex is connected to all others by an edge, so it is a fully connected graph.

  20. 20.

    Density is the ratio between the number of edges in the graph and the maximal number of edges.

  21. 21.

    Project Apoena: http://bit.ly/proj-apoena.

  22. 22.

    Lab CSX: http://www.labcsx.dcc.ufmg.br.

  23. 23.

    Piim-Lab: http://piim-lab.decom.cefetmg.br.

References

  1. Alves, G.B., Brandão, M.A., Santana, D.M., da Silva, A.P.C., Moro, M.M.: The Strength of Social Coding Collaboration on GitHub. In: Simpósio Brasileiro de Banco de Dados (SBBD), pp. 247–252. Salvador, Brasil (2016)

    Google Scholar 

  2. Azeroual, O., Saake, G., Schallehn, E.: Analyzing data quality issues in research information systems via data profiling. Int. J. Inf. Manag. 41, 50–56 (2018). https://doi.org/10.1016/j.ijinfomgt.2018.02.007

    Article  Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley Publishing Company, New York (2011)

    Google Scholar 

  4. Bansal, S.K.: Towards a semantic extract-transform-load (ETL) framework for big data integration. In: Proceedings of IEEE International Congress on Big Data (BigData Congress), Anchorage, AK, USA, pp. 522–529 (2014)

    Google Scholar 

  5. Batista, N.A., Brandão, M.A., Alves, G.B., da Silva, A.P.C., Moro, M.M.: Collaboration strength metrics and analyses on GitHub. In: Proceedings of the International Conference on Web Intelligence, Leipzig, Germany, pp. 170–178 (2017). https://doi.org/10.1145/3106426.3106480

  6. Bouzeghoub, M., Lóscio, B.F., Kedad, Z., Soukane, A.: Heterogeneous data source integration and evolution. In: Proceedings of International Conference on Database and Expert Systems Applications (DEXA), Aix-en-Provence, France, pp. 751–757 (2002). https://doi.org/10.1007/3-540-46146-9_74

    Google Scholar 

  7. de Souza Silva, L., Murai, F., da Silva, A.P.C., Moro, M.M.: Automatic identification of best attributes for indexing in data deduplication. In: Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management, Cali, Colombia (2018)

    Google Scholar 

  8. Doan, A., Konda, P., Ardalan, A., Ballard, J.R., Das, S., Govind, Y., Li, H., Martinkus, P., Mudgal, S., Paulson, E., et al.: Toward a system building agenda for data integration (and data science). IEEE Data Eng. Bull. 41(2), 35–46 (2018)

    Google Scholar 

  9. Farnadi, G., Tang, J., De Cock, M., Moens, M.F.: User profiling through deep multimodal fusion. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 171–179 (2018). https://doi.org/10.1145/3159652.3159691

  10. Fielding, R.T.: Architectural styles and the design of network-based software architectures. Ph.D. thesis, University of California, Irvine (2000)

    Google Scholar 

  11. Freitas, R., Rocha, C., Braga, O., Lopes, G., Monteiro, O., Oliveira, M.: Using linked data in the data integration for maternal and infant death risk of the SUS in the GISSA Project. In: Proceedings of the 23rd Brazilian Symposium on Multimedia and the Web, Gramado, RS, Brazil, pp. 193–196 (2017). https://doi.org/10.1145/3126858.3131606

  12. Geerts, F., Missier, P., Paton, N.: Editorial: Special issue on improving the veracity and value of big data. J. Data Inf. Qual. 9(3), 13:1–13:2 (2018). https://doi.org/10.1145/3174791

  13. Golshan, B., Halevy, A., Mihaila, G., Tan, W.C.: Data integration: after the teenage years. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Chicago, Illinois, USA, pp. 101–106 (2017). https://doi.org/10.1145/3034786.3056124

  14. Goodman, L.A.: Snowball sampling. Ann. Math. Stat. 32, 148–170 (1961)

    Article  MathSciNet  Google Scholar 

  15. Laender, A.H.F., Moro, M.M., Nascimento, C., Martins, P.: An X-ray on web-available XML schemas. SIGMOD Rec. 38(1), 37–42 (2009). https://doi.org/10.1145/1558334.1558338

    Article  Google Scholar 

  16. Liu, J., Ram, S.: Using big data and network analysis to understand Wikipedia article quality. Data Knowl. Eng. 115, 80–93 (2018). https://doi.org/10.1016/j.datak.2018.02.004

    Article  Google Scholar 

  17. Ma, F., Meng, C., Xiao, H., Li, Q., Gao, J., Su, L., Zhang, A.: Unsupervised discovery of drug side-effects from heterogeneous data sources. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 967–976 (2017). https://doi.org/10.1145/3097983.3098129

  18. Moro, M.M., Braganholo, V., Dorneles, C.F., Duarte, D., de Matos Galante, R., dos Santos Mello, R.: XML: some papers in a haystack. SIGMOD Rec. 38(2), 29–34 (2009). https://doi.org/10.1145/1815918.1815924

    Article  Google Scholar 

  19. Sikos, L.: Mastering Structured Data on the Semantic Web: From HTML5 Microdata to Linked Open Data. Apress, New York (2015)

    Book  Google Scholar 

  20. Tyagi, N.K., Solanki, A., Tyagi, S.: An algorithmic approach to data preprocessing in web usage mining. Int. J. Inf. Technol. Knowl. Manag. 2(2), 279–283 (2010)

    Google Scholar 

  21. Vasilescu, B., Serebrenik, A., Filkov, V.: A data set for social diversity studies of GitHub teams. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 514–517 (2015). https://doi.org/10.1109/MSR.2015.77

  22. Wang, L., Pan, R., Wang, X., Fan, W., Xuan, J.: A Bayesian reliability evaluation method with different types of data from multiple sources. Reliab. Eng. Syst. Saf. 167, 128–135 (2017). https://doi.org/10.1016/j.ress.2017.05.039

    Article  Google Scholar 

  23. Wang, R., Ji, W., Liu, M., Wang, X., Weng, J., Deng, S., Gao, S., Yuan, C.a.: Review on mining data from multiple data sources. Pattern Recogn. Lett. 109, 120–128 (2018). https://doi.org/10.1016/j.patrec.2018.01.013

    Article  Google Scholar 

  24. Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 5(6), 550–561 (2012). https://doi.org/10.14778/2168651.2168656

    Article  Google Scholar 

Download references

Acknowledgements

The research that resulted in the writing of this chapter was funded by CAPES, CNPq and FAPEMIG.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michele A. Brandão .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Batista, N.A., Brandão, M.A., Pinheiro, M.B., Dalip, D.H., Moro, M.M. (2020). Data from Multiple Web Sources: Crawling, Integrating, Preprocessing, and Designing Applications. In: Roesler, V., Barrére, E., Willrich, R. (eds) Special Topics in Multimedia, IoT and Web Technologies. Springer, Cham. https://doi.org/10.1007/978-3-030-35102-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-35102-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-35101-4

  • Online ISBN: 978-3-030-35102-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics