Abstract
The Web contains a huge volume of information supplied by diverse sources such as e-commerce sites, electronic directories, search engines, etc. The difficulty of the task of automating information extraction from these sources lives on the fact that these last ones were conceived for a human access (manual navigation). This difficulty is increased as the number of sources in question increases. In this paper, we are interested in the problem of EI, from several sources. The first approach to resolve this problem consists in suggesting a new method of EI and applying it to the various sources. This approach is not very successful and it is difficult to implement, especially when the sources are very heterogeneous. Therefore, We propose a more effective alternative, allowing us to benefit from already existing methods and tools, by applying to every source, the tool which suits most. For that purpose, we exploit domain ontology to deduct the tool adequate to every source. In this paper, we present the WebOMSIE system, an ontology-based framework of multi source information extraction from the Web.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chang, C.H., Kayed, M., Moheb, R.G., Shaalan, K.: A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering 18 (2006)
Laender, A.-H.-F., RibeiroNeto, B.-A., da Silva, A.-S., Teixeira, J.-S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record (2002)
Laender, A.-H.-F., RibeiroNeto, B.-A., da Silva, A.-S.: DEByE - Data Extraction by example. Data and Knowledge Engineering (2001)
Lenzerini, M.: Data Integration: A Theoretical Perspective. In: Symposium on Principles of Database Systems (2002)
Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents, Databases, and Webs. In: Proc. 14th IEEE Int’l Conf. Data Eng., pp. 24–33 (1998)
Bechofer, S.: The DIG Descirption Logic Interface: DIG/1.1. University of Manchester (2007)
Habegger, B.: Multi-Pattern Wrappers for Relation Extraction from the Web. In: Proceedings of the European Conference on Artificial Intelligence (2002)
Habegger, B.: Extraction d’informations partir du Web. Phd thesis Nantes University (2004)
Hogue, A., Karger, D.: Thresher: Automating the Unwrapping of Semantic Content from the World Wide. In: Proc. 14th Int’l Conf. World Wide Web, pp. 86–95 (2005)
Embley, D.-W., Campbell, D.-M., Jiang, Y.-S., Liddle, S.-W., Lonsdale, D.-.W., Ng, Y.-K., Smith, R.-D.: Conceptual-Model-Based Data Extraction from Multiple-Record Web pages. Data and Knowledge Engineering 31, 227–251 (1999)
Bijan Parsia, B., Evren, S.: Pellet: An owl dl reasoned. In: Proceedings of the International Workshop on Description Logics (2004)
Wang, J., Lochovsky, F.H.: Data Extraction and Label Assignment for Web Databases. In: Proc. 12th Int’l Conf. World Wide Web (WWW), pp. 187–196 (2003)
Chang, C.-H., Lui, S.-C.: IEPAD: Information Extraction based on Pattern Discovery. In: Proceedings of the ACM WWW 10 Conference (2001)
Hsu, C.-N., Dung, M.-T.: Generating finite state transducers for semi-structured data extraction from the web. Information Systems 23, 521–538 (1998)
Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured Data: The TSIMMIS Experience. In: Proc. First East-European Symp. Advances in Databases and Information Systems (1997)
Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent 1 (2001)
Aderlberg, B.: NoDoSE: A Tool for Semi-Automatically Extracting Structured and Semi-Structured Data from Text Document. SIGMOD Record 27, 283–294 (1998)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118, 15–68 (2000)
Kushmerick, N.: Finite-state approaches to web Information Extraction. In: 3rd Summer Convention on Information Extraction (2002)
Sahuguet, A., Azavant, F.: Buildingintelligent web applications using lightweight wrappers. Data and Knowledge Engineering 36 (2001)
Soderlan, S.: Learning Information Extraction Rules for semi-Structured and Free Text. Machine Learning 34, 233–272 (1999)
Freitag, D.: Machine Learning for information Extraction in informal domains. Machine Learning 39, 169–202 (2000)
Crescenzi, V., Mecca, G.: Grammers Have Execptions. Information Systems 23, 539–565 (1998)
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enable Wrapper Construction System for web information Sources. In: Proceedings of the 16th IEEE International Conference on Data Engineering, pp. 611–621 (2000)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 26th International Conference on Very Large Database Systems, pp. 109–118 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Younsi, Z., Quafafou, M., Ouzegane, R., Tari, A. (2013). WebOMSIE: An Ontology-Based Multi Source Web Information Extraction. In: Pechenizkiy, M., Wojciechowski, M. (eds) New Trends in Databases and Information Systems. Advances in Intelligent Systems and Computing, vol 185. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32518-2_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-32518-2_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32517-5
Online ISBN: 978-3-642-32518-2
eBook Packages: EngineeringEngineering (R0)