Skip to main content

Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID)

  • Conference paper
  • First Online:
Regional Conference on Science, Technology and Social Sciences (RCSTSS 2016)

Abstract

The amount of all kinds of data that available electronically has increased dramatically in recent years. The data resides in different types, either in structured (SD), semi-structured (SSD) or unstructured data (USD). Data integration for multiple types of data can be defined as the problem of combining data from heterogeneous sources to one unified structure. A user is unable to view it as a single entity irrespective of the origination or its data type. It involves combining data coming from different sources and providing users with a unified view of these data. In this paper, we propose a diagrammatic representation of a wrapper for multiple types of SSD data extraction using Document Object Model (DOM). We have implemented the automated web extractor, Wrapper for Extraction of Image using DOM (WEID) using the PHP programming language that can extract images from a web page. Experimental results on a web page are encouraging and confirm the feasibility of the approach in extracting images successfully. Our approach is less labour-intensive, and we believe via our technique that automatic extraction of images can be done fast and effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Abiteboul S (1997) Querying semi-structured data. Paper presented at the ICDT‘97 proceedings of the 6th international conference on database theory

    Google Scholar 

  • Buneman P (ed) (1997) Semistructured data: PODS‘97 Tucson Arizona USA

    Google Scholar 

  • Chakraborty S, Chaki N (2011, 14–16 Dec) A survey on the semi-structured data models, Kolkata

    Google Scholar 

  • Direct S (2017) Data integration. Retrieved 24 Oct 2017. http://www.sciencedirect.com/search?qs=data+integration

  • Gupta A, Anand Shankar S, Manjunath C (2017) A comparative study on data extraction and its processes. Int J Appl Eng Res 12(18):7194–7201

    Google Scholar 

  • Izquierdo JLC, Cabot J (2014) Composing JSON-based web APIs. In: Web engineering. Springer, pp 390–399

    Google Scholar 

  • Laender AH, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. ACM Sigmod Record 31(2):84–93

    Article  Google Scholar 

  • Liu J, Zhang XX (2014) Data integration in fuzzy XML documents. Inf Sci 280:82–97. https://doi.org/10.1016/j.ins.2014.04.052

    Article  Google Scholar 

  • Man M, Bakar WAWA, Ali NH, Jalil MA (2015) Hybrid federated data warehouse integration model: implementation in mud crabs case study. J Sci Technol 36(2):28–38

    Google Scholar 

  • Man M, Sabri IAA, Jalil MMA, Ali NH, Muhamad S (2016) Information integration architecture system for empowering rural woman in setiu wetlands. Paper presented at the Seminar ekosistem Setiu 2016: Sains marin & sumber akuatik untuk kelangsungan hidup, Universiti Malaysia Terengganu

    Google Scholar 

  • Nierman A, Jagadish HV (2002) ProTDB: probabilistic data in XML. Paper presented at the ACM international conference on very large databases

    Google Scholar 

  • Nurseitov N, Paulson M, Reynolds R, Izurieta C (2009) Comparison of JSON and XML data interchange formats: a case study. Caine 2009:157–162

    Google Scholar 

  • Rawat P, Sayyad S, Surinder S, Shelke S (2016) Application for web data extraction and analysis. Imperial J Interdiscip Res 2(7)

    Google Scholar 

  • Ronk J (2014) Structured, semi-structured and unstructured data. Retrieved 29 July 2015, from https://jeremyronk.wordpress.com/2014/09/01/structured-semi-structured-and-unstructured-data/

  • Sabri IAA, Man M (2017, 17–18 May) WEIDJ: an improvised algorithm for image extraction from web pages. Paper presented at the 8th international conference on information technology, Al-Zaytoonah University of Jordan (ZUJ), Amman, Jordan

    Google Scholar 

  • Scheuermann P (1990) Report on the workshop on heterogenous database systems held at Northwestern University, Evanston. IL. SIGMOD Record 19(4):23–31

    Article  Google Scholar 

  • Ziegler P, Dittrich KR (2007) Data integration—problems, approaches, and perspectives. Conceptual Model Info Syst Eng 39–58

    Google Scholar 

Download references

Acknowledgements

This research is supported by Scholarship of Universiti Malaysia Terengganu (BUMT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ily Amalina Sabri Ahmad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sabri Ahmad, I.A., Man, M. (2018). Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID). In: Yacob, N., Mohd Noor, N., Mohd Yunus, N., Lob Yussof, R., Zakaria, S. (eds) Regional Conference on Science, Technology and Social Sciences (RCSTSS 2016) . Springer, Singapore. https://doi.org/10.1007/978-981-13-0074-5_6

Download citation

Publish with us

Policies and ethics