Abstract
The amount of all kinds of data that available electronically has increased dramatically in recent years. The data resides in different types, either in structured (SD), semi-structured (SSD) or unstructured data (USD). Data integration for multiple types of data can be defined as the problem of combining data from heterogeneous sources to one unified structure. A user is unable to view it as a single entity irrespective of the origination or its data type. It involves combining data coming from different sources and providing users with a unified view of these data. In this paper, we propose a diagrammatic representation of a wrapper for multiple types of SSD data extraction using Document Object Model (DOM). We have implemented the automated web extractor, Wrapper for Extraction of Image using DOM (WEID) using the PHP programming language that can extract images from a web page. Experimental results on a web page are encouraging and confirm the feasibility of the approach in extracting images successfully. Our approach is less labour-intensive, and we believe via our technique that automatic extraction of images can be done fast and effectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abiteboul S (1997) Querying semi-structured data. Paper presented at the ICDT‘97 proceedings of the 6th international conference on database theory
Buneman P (ed) (1997) Semistructured data: PODS‘97 Tucson Arizona USA
Chakraborty S, Chaki N (2011, 14–16 Dec) A survey on the semi-structured data models, Kolkata
Direct S (2017) Data integration. Retrieved 24 Oct 2017. http://www.sciencedirect.com/search?qs=data+integration
Gupta A, Anand Shankar S, Manjunath C (2017) A comparative study on data extraction and its processes. Int J Appl Eng Res 12(18):7194–7201
Izquierdo JLC, Cabot J (2014) Composing JSON-based web APIs. In: Web engineering. Springer, pp 390–399
Laender AH, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. ACM Sigmod Record 31(2):84–93
Liu J, Zhang XX (2014) Data integration in fuzzy XML documents. Inf Sci 280:82–97. https://doi.org/10.1016/j.ins.2014.04.052
Man M, Bakar WAWA, Ali NH, Jalil MA (2015) Hybrid federated data warehouse integration model: implementation in mud crabs case study. J Sci Technol 36(2):28–38
Man M, Sabri IAA, Jalil MMA, Ali NH, Muhamad S (2016) Information integration architecture system for empowering rural woman in setiu wetlands. Paper presented at the Seminar ekosistem Setiu 2016: Sains marin & sumber akuatik untuk kelangsungan hidup, Universiti Malaysia Terengganu
Nierman A, Jagadish HV (2002) ProTDB: probabilistic data in XML. Paper presented at the ACM international conference on very large databases
Nurseitov N, Paulson M, Reynolds R, Izurieta C (2009) Comparison of JSON and XML data interchange formats: a case study. Caine 2009:157–162
Rawat P, Sayyad S, Surinder S, Shelke S (2016) Application for web data extraction and analysis. Imperial J Interdiscip Res 2(7)
Ronk J (2014) Structured, semi-structured and unstructured data. Retrieved 29 July 2015, from https://jeremyronk.wordpress.com/2014/09/01/structured-semi-structured-and-unstructured-data/
Sabri IAA, Man M (2017, 17–18 May) WEIDJ: an improvised algorithm for image extraction from web pages. Paper presented at the 8th international conference on information technology, Al-Zaytoonah University of Jordan (ZUJ), Amman, Jordan
Scheuermann P (1990) Report on the workshop on heterogenous database systems held at Northwestern University, Evanston. IL. SIGMOD Record 19(4):23–31
Ziegler P, Dittrich KR (2007) Data integration—problems, approaches, and perspectives. Conceptual Model Info Syst Eng 39–58
Acknowledgements
This research is supported by Scholarship of Universiti Malaysia Terengganu (BUMT).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sabri Ahmad, I.A., Man, M. (2018). Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID). In: Yacob, N., Mohd Noor, N., Mohd Yunus, N., Lob Yussof, R., Zakaria, S. (eds) Regional Conference on Science, Technology and Social Sciences (RCSTSS 2016) . Springer, Singapore. https://doi.org/10.1007/978-981-13-0074-5_6
Download citation
DOI: https://doi.org/10.1007/978-981-13-0074-5_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0073-8
Online ISBN: 978-981-13-0074-5
eBook Packages: Business and ManagementBusiness and Management (R0)